Skip to content

KV-Cache Hit Rate: Why the Context Window Is Memory You Pay Rent On

Start with the number, because the number is the argument. The team behind the Manus agent put it plainly: "the KV-cache hit rate is the single most important metric for a production-stage AI agent" (Manus blog, 2025-07-18). Not accuracy. Not latency. A cache metric. And the reason is economic: an agent reads far more than it writes, the read is mostly the same tokens turn after turn, and every major provider now prices a re-read of those tokens at roughly a tenth of a fresh one. The context window is not free storage. It is memory you pay rent on — and the rent has a published rate card.

TL;DR

  • A cache metric is the top agent metric because agents run ~100:1 input-to-output tokens — the reused prefix dominates the bill (Manus).
  • A cache read is steeply discounted at every major provider — ~10% of base at Anthropic (0.1x) and DeepSeek (~1/10), ~25% at Gemini 2.5+ (75% off), up to 90% off at OpenAI, and ~50% at Groq. The cross-provider convergence is the hard evidence.
  • A cache hit needs an exact, unbroken prefix from token 0 — so the discipline is append-only context with a stable prefix. That is memory management, not prompt tweaking.
  • Context engineering manages the working-memory tier; persistent memory is the long-term tier that feeds it. Calling them one thing is an argued position — now mainstream framing, not a settled fact.

The economics first: every provider discounts a cache read

The cleanest evidence for the thesis is a pricing table, and it is the same shape at every vendor. Cache reads are cheap; fresh input is not; and the providers, working independently, landed on the same order of magnitude.

ProviderMechanismCache-read costTriggerSource
AnthropicPrompt caching (explicit breakpoints)0.1x base input (writes 1.25x / 2x)per-model min (1,024 tok for current Sonnet/Opus)docs
Google Gemini 2.5+Implicit caching (on by default)25% of base (75% off)automaticGoogle blog
DeepSeekOn-disk context caching~1/10 of standard inputauto, exact-prefix from token 0DeepSeek
OpenAIAutomatic prefix cachingup to 90% off cached inputautomatic, >1,024 tokensOpenAI
GroqPrompt caching~50% off cached inputauto, exact-prefix matchGroq docs

Read it as a CFO would. Anthropic and DeepSeek price a re-read at ~10% of a fresh read, and OpenAI now lists up to 90% off; Gemini cuts it ~75%; Groq roughly halves it. The exact depth differs, but the direction does not. Independent companies do not converge on discounting a re-read by accident — they converge because they are all pricing the same underlying fact: re-serving an unchanged prefix is nearly free for them, so they pass the saving on. The Manus team's worked example sits at the deep end of this band: at the Claude Sonnet rates current when they published, cached input ran $0.30/MTok against $3/MTok uncached — a 10x gap (Manus). That "$3 → $0.30" line is a dated Sonnet example, not today's universal price; base rates have since changed and differ per model. The structural fact that holds across time and vendors is the one that matters: a cache read costs a fraction of a fresh read — as little as a tenth.

Now multiply by the agent's workload. Manus reports an average input-to-output token ratio "around 100:1" (Manus) — the model reads a hundred tokens for every one it emits. When the overwhelming majority of spend is on the read side, and the read is the same long prefix you sent last turn, the single highest-leverage move is to make that prefix cacheable. That is why a cache metric outranks every other agent metric. The hit rate is not a tuning knob. It is the meter on the working memory.

What the KV-cache is, and why a hit needs an exact prefix

KV-cache is the stored key and value tensors for tokens the model has already processed, kept so attention does not recompute them on the next decoding step.

The mechanism falls straight out of the attention definition. A transformer computes softmax(QK^T/√d_k)V, and each token's key and value depend only on that token and the ones before it — never on tokens that come later (Vaswani et al., 2017). So once you have computed K and V for a position, they are valid forever; you can cache them and reuse them. The cache removes the redundant recompute, shrinking the constant — it does not remove the O(T²) attention term, which is why long context stays expensive even with a perfect cache.

That permanence comes with a hard condition. The keys and values for position n were computed in the presence of every token before n. Change any earlier token and those tensors are no longer valid — the cache is dead from the change point onward.

KV-cache hit rate is the fraction of input tokens served from cache rather than recomputed; it rises when the prefix is stable and falls the moment an early token changes.

This is why every provider's caching guidance says the same thing: put static content first (system prompt, tools, examples), put variable content last (the user's latest message, timestamps), and never mutate what came before. OpenAI, Groq, and DeepSeek all require an exact prefix match from token 0 (OpenAI; Groq; DeepSeek). A single injected timestamp near the top of the prompt resets the meter to zero.

Prefix caching is reusing the stored KV state for an unbroken run of tokens shared with a previous request, valid only up to the first token that differs.

Two design rules follow, and they are not tuning preferences — they are the definition of a working memory: keep the opening of the context byte-identical across turns (no timestamps, no reordered tool definitions, no nondeterministic JSON key ordering), and add only to the end. Manus states it directly: "Make your context append-only. Avoid modifying previous actions or observations" (Manus). Append-only is exactly how a working memory behaves — you accumulate, you do not rewrite history.

One common error is worth correcting here: the serving systems do not all use the same trick. vLLM uses hash-based automatic prefix caching; the radix-tree mechanism people sometimes attribute to it is SGLang's RadixAttention, a different system. The interface fact, though, is identical everywhere: identical prefix, cheap; changed prefix, full price.

The serving layer: how the cache got cheap enough to price

The economics above only exist because the systems layer made KV-cache reuse efficient. PagedAttention, the algorithm behind vLLM, applies OS-style paging to the KV-cache, achieving near-zero KV-cache memory waste and 2–4x throughput over the prior state of the art at the same latency — and, critically, it enables prefix sharing across requests (Kwon et al., 2023). ("Near-zero waste" is the paper's own claim; the larger waste figures sometimes quoted are secondary paraphrases, not a verbatim result.) Underneath that, FlashAttention made exact attention IO-aware with tiling — up to 3x on GPT-2 with longer context (Dao et al., 2022) — and FlashAttention-2 roughly doubled it again, up to 225 TFLOPs/s on an A100 (Dao, 2023). Prompt Cache precomputes reusable "prompt modules" to cut time-to-first-token — up to 8x on GPU, up to 60x on CPU (Gim et al., 2023).

The model architecture itself also learned to carry less KV. One compact table covers the field — each row trades a little quality for a smaller cache that lets a long stable prefix stay resident and reusable:

TechniqueWhat it doesKV-cache effectSource
MQASingle shared K/V head (the original recipe)Large reduction, some quality costShazeer, 2019
GQAInterpolates MHA↔MQA; uptrains from MHA at ~5% computeReduction set by heads/groupsAinslie et al., 2023
CLAShares K/V across adjacent layers~2x further over MQA, comparable accuracyBrandon et al., 2024
MLACompresses K/V into a low-rank latentShipped in frontier (DeepSeek-V2) modelsDeepSeek-AI, 2024

(GQA's often-quoted "8x" is an illustrative heads-to-groups example, not a headline.) The point is not the catalogue. An entire stack — kernels, paging, attention variants — exists to make one thing affordable: keeping a stable prefix cheap to re-read. The market built a memory tier, then put a price on it.

A KV-cache hit rate playbook for context engineering

Once you accept that the window is a finite, append-only, rented memory, the named best practices stop looking like prompt tricks and start looking like memory operations. Each move below has one job — keep the longest possible prefix byte-for-byte identical across turns — and one cited number.

1. Build one stable prompt prefix. A cache hit needs an exact, unbroken prefix from token 0, so put everything invariant — system prompt, tool definitions, persistent instructions — at the very front and never touch it. The classic own-goal is a timestamp in the system prompt: it changes every second and invalidates the cache on every call (Manus). One practitioner measured the payoff directly: a stable prefix ran at ~2,258 ms versus ~3,714 ms when perturbed — about a 39% latency improvement on average (Ankit Sinha, 2025, practitioner-measured).

2. Make the context append-only. Never mutate earlier actions or observations; only add to the end. Any edit to a cached segment invalidates the cache from the change point onward — and non-deterministic JSON key ordering silently breaks the prefix even when the content is identical. The fix is canonicalization: serialize the prefix deterministically — stable key ordering, fixed formatting, no volatile fields — so the prompt cache key (the exact token sequence from position 0 that the provider hashes to find a hit) is byte-for-byte identical across turns (Manus). This is the defining property of a working memory: you write new entries, you do not rewrite history.

3. Mask tools, don't remove them. Tool definitions live near the front of the context; adding or dropping them per step destroys the prefix cache. Manus's answer is mask, don't remove — keep all tools in the stable prefix and constrain the action space by masking token logits during decoding (Manus). The cache survives; the agent still only picks from the allowed set.

4. Place dynamic content last, behind cache breakpoints. Every provider that documents caching says static content first, variable content (user query, fresh tool results) last (OpenAI; Groq). The empirical caveat is sharp: Don't Break the Cache measured prompt caching across OpenAI, Anthropic, and Google on long-horizon agent tasks and found that dynamic content injected into the cached region — or dynamic tool definitions — kills hits and can paradoxically raise latency (Lumer et al., Jan 2026). Keep the churn at the tail.

5. Externalize to files and persistent memory. The window is finite; durable knowledge does not belong inside it. Manus treats the file system as "the ultimate context… unlimited in size, persistent by nature, and directly operable" — explicitly externalized memory (Manus). Anthropic frames the same move as just-in-time retrieval: keep lightweight identifiers in context, load the payload at runtime (Anthropic, 2025-09-29). LangChain's Write/Select/Compress/Isolate taxonomy gives the verbs (Martin, 2025-06): write it out, select it back in.

6. Compact, don't truncate. Near the limit, do not slice off the oldest tokens — that breaks the cache and loses information. Anthropic's compaction summarizes the context and reinitializes from the compressed summary, deliberately preserving architectural decisions and unresolved bugs; sub-agents are the same idea at a larger grain, returning a condensed summary to the main thread (Anthropic).

7. Recite the goal to keep it in attention. Long contexts rot. Chroma's "Context Rot" report tested 18 frontier models and found recall grows increasingly unreliable as input length grows — before the window is full, even on simple tasks (Hong et al., 2025-07-14). Manus's countermove is recitation — continuously rewriting a todo.md-style summary at the end of the context so the objective stays in recent attention (Manus). Recitation appends; it does not mutate the prefix, so moves 2 and 7 cooperate.

8. Measure the hit rate. You cannot optimize what you do not instrument. Most APIs report cached versus uncached input tokens per request — turn that into a hit-rate gauge and watch it move as you apply the moves above. It is the leading indicator for both the cost line and the latency line, which is the entire reason Manus elevates it to the #1 metric.

These moves are memory operations by another name. Context engineering is, in Anthropic's definition, "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference" — distinct from prompt engineering, which is single-shot instruction writing (Anthropic). Anthropic is explicit that the resource is finite: "Context, therefore, must be treated as a finite resource with diminishing marginal returns." A finite resource that degrades as you fill it is the definition of something you must manage, not just expand.

The honest bridge: working memory vs long-term memory

Here is the careful version. Calling context engineering "memory engineering" is partly our synthesis and partly already mainstream — Gartner declared 2026 "the Year of Context," so the framing is widely shared. But "mainstream framing" means the topic is mainstream, not that the equation is settled: no primary source proves the two are identical, and the literature is split on direction. A Survey of Context Engineering treats memory as a component of context engineering (Mei et al., 2025), while the Memory in the Age of AI Agents survey (arXiv Dec 2025, v2 Jan 2026) deliberately separates agent memory from context engineering (2512.13564). So I present the equation as an argued position, not a sourced fact. I will not flatten that tension.

The distinction that survives it is the useful one. The context window is working memory — finite, fast, in-attention, and (as context rot shows) capacity-limited, much like the working memory Baddeley and Hitch described, though that cognitive analogy is illustrative, not load-bearing. Durable knowledge belongs in a long-term tier outside the window, paged in on demand. MemGPT made the analogy concrete with OS-style tiering — core memory in-context as RAM, recall and archival memory searchable as disk (Packer et al., 2023). Manus's "file system as context" and Anthropic's "just-in-time retrieval" are the same architecture under different names: keep the working set small and stable; store the rest; fetch when needed.

This is the modest, earned place for Mnemoverse. Cache-aware context design and a persistent memory engine are the same problem at two scales — one manages the rented working tier, the other is the durable long-term tier that feeds it. Get the working tier wrong and you pay rent for noise; get the long-term tier wrong and the agent starts cold every session. They compose; neither replaces the other. Whether the memory you recall is actually right is a separate measurement — which is where the evaluation work comes in.

Common questions

What is the KV-cache hit rate for AI agents? It is the share of your input tokens served from cache instead of recomputed. The Manus team calls it "the single most important metric for a production-stage AI agent," because an agent's input-to-output token ratio runs about 100:1 — so the reused prefix dominates cost and latency.

How much does prompt caching cost across providers? Cache reads are priced far below fresh input. Anthropic charges 0.1x base input for a cache read; DeepSeek charges about 1/10; Gemini 2.5+ gives a 75% discount (cached = 25% of base); OpenAI lists up to 90% off cached input, and Groq is roughly 50% off. Every major provider discounts a re-read — the cross-provider convergence is the signal.

What is context engineering for AI agents? Anthropic defines it as "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference" — distinct from prompt engineering, which is single-shot instruction writing. In practice it is managing the context window as a finite working memory.

Why does mutating early context destroy the KV-cache? A cache hit requires an exact, unbroken prefix from token 0. Any change to a cached prefix invalidates everything from the change point onward. So the disciplined pattern is append-only context with a stable prefix — the defining property of a working memory.

Is context engineering the same as memory engineering? Partly, and the framing is now mainstream — Gartner called 2026 "the Year of Context." Context engineering manages the working-memory tier (the finite window); persistent memory is the long-term tier that feeds it. The academic taxonomy still separates them, so treat the equation as an argued position, not a settled fact.

Sources


Edward Izgorodin, June 2026 — LinkedIn