Context Budgeting: How to Allocate Agent Tokens and What to Cut First
TL;DR
- Context budgeting allocates a finite attention budget across context zones before an agent overflows the window.
- Treat system instructions, tool schemas, the stable prefix, retrieval, history, tool outputs, and output buffer as separate budget lines with minimums, maximums, and eviction rules.
- Cut old tool results first, then compact dialogue, then offload durable state to external memory.
- Keep high-value context near the head or tail of its zone, because “Lost in the Middle” shows weaker retrieval when relevant content sits in the middle of long inputs.
Context budgeting is the practice of allocating a finite model context window across protected and spillable zones, then applying a defined eviction order when tokens run short.
Anthropic frames the underlying constraint as an “attention budget”: the model needs the “smallest possible set of high-signal tokens,” and “every new token depletes this budget” (Anthropic, effective context engineering for agents). Redis makes the same point in resource terms: “the context window is a rival resource,” so every token assigned to one component displaces a token from another (Redis, context window). Redis also describes context orchestration as assembling a token-budgeted bundle for the model (Redis, context orchestration).
That is the practical problem. Agents do not fail only because the window is too small. They fail because the window fills without a policy. Tool output grows. Dialogue history repeats itself. Retrieval expands until it crowds out the answer buffer. A late edit to early prompt text can also erase cache benefits.
This article is the budgeting layer under a broader optimization stage. The optimization stage chooses among competing objectives. Budgeting answers a narrower question: how many tokens does each context zone get, and what leaves first?
Context budget zones: the six claimants on agent tokens
A context budget should name the competing claimants. Redis lists the common zones as system instructions, tool schemas, history, retrieved chunks, tool outputs, and output buffer (Redis, context window). In practice, a production agent should split those into protected and spillable tiers.
| Zone | Role | Budget posture | First rule |
|---|---|---|---|
| System instructions | Behavioral contract | Protected | Keep stable and small enough to preserve downstream room |
| Tool schemas | Available actions | Mostly protected | Include only tools that can be called in this turn |
| Stable prefix | Cacheable prompt prefix | Committed | Do not reallocate turn to turn |
| Retrieved chunks | Task evidence | Bounded and ranked | Reserve a floor so retrieval is not starved |
| Dialogue history | User and agent state | Spillable after summary | Compact before blind truncation |
| Tool outputs | Observations from tools | Most spillable | Clear old bodies when tools are re-callable |
| Output buffer | Room for the answer | Protected minimum | Do not spend the response budget on input clutter |
Some costs are fixed enough to plan around. As an illustrative reference, Wire estimates that a system prompt tops out around 5K tokens regardless of window size and 15 tools at 500 tokens each add 7.5K tokens of tool-schema overhead (Wire, context budgets). Redis makes the qualitative point: as the window grows, those fixed zones shrink as a percentage, while retrieval, history, and buffer can expand (Redis, context window).
This is why “bigger window” is not the same as “no budget.” Claude context-window documentation describes 200K context and a 1M general-availability context window (Claude models overview); OpenAI's GPT-5.5 exposes a ~1M-token context window over the API (the 400K figure widely cited is the Codex CLI surface cap, not the API limit). The common approximation that 1K tokens is about 750 words still holds. Those sizes increase room, but they do not decide what deserves attention.
Token budget allocation: use min, max, protected, and spillable tiers
A usable budget is not a single number. It is a set of bounds.
For each zone, define:
- Minimum budget: the zone cannot fall below this without changing task behavior.
- Target budget: the normal allocation when pressure is low.
- Maximum budget: the zone cannot grow beyond this without explicit promotion.
- Protection tier: protected, compactable, clearable, or offloadable.
- Positioning rule: where high-value tokens sit inside the zone.
A published blog reference from Wire suggests one illustrative allocation: system 10–15%, tools 15–20%, retrieval 30–40%, history 20–30%, and buffer 10–15% (Wire, context budgets). Treat that as one reference allocation, not a standard. Vendor mechanisms tend to be clearing, compaction, trimming, and trigger thresholds, not universal percentages.
A better engineering default is to reserve floors and caps.
For example:
context_budget:
system:
min: fixed
max: fixed
tier: protected
tool_schemas:
min: selected_tools_only
max: turn_relevant_tools
tier: mostly_protected
stable_prefix:
min: committed
max: committed
tier: non_reallocatable
retrieval:
min: evidence_floor
max: ranked_cap
tier: compactable
history:
min: recent_turns_plus_summary
max: bounded
tier: compactable
tool_outputs:
min: latest_required_observations
max: small
tier: clearable
output_buffer:
min: answer_floor
max: task_dependent
tier: protectedThe exact numbers depend on model, task, latency target, and tool design. The important point is the shape. Protected zones survive pressure. Spillable zones have an eviction rule before the agent reaches the limit.
This design also matches a working-memory distinction. Slot-like systems fail at a cliff: an item is in memory or out. Resource-like systems degrade more gradually as attention spreads. The working-memory framing maps well to agents: most systems truncate like slots, but they should budget like resources, with ranked priority and protected tiers.
ContextBudget and dynamic allocation under a finite context window
The academic anchor for the term is ContextBudget, arXiv:2604.01664 (ContextBudget). It frames context management as a budget-constrained sequential decision problem.
The paper makes the state budget-conditioned through remaining budget:
r_t = B - |C_t|It also decides among three compression regimes before loading a new observation: Null, Partial, and Full compression. The constraint is explicit:
|C'_t| ≤ B - |o_t|That matters because eviction after overflow is often too late. If the next observation is large, the agent must create room before it loads the observation, not after it has already polluted the prompt.
The reported measurements are also relevant. BACM-RL improved over the MEM1 baseline by +67% F1 on 16-objective and +143% on 32-objective tasks under an 8K budget (Qwen2.5-7B; ContextBudget), with performance described as near-invariant from 16K to 4K. The takeaway is not that every system gets those gains. It is that budget-aware compression can be a first-order control surface, not a formatting detail.
For implementation, connect budgeting to the context compiler. Budget-select is one pass in an end-to-end context build. It should run before final rendering, not as a last-minute string slice.
What to cut from context first: tool results, then dialogue, then memory
Eviction order is where context budgeting becomes operational.
1. Clear old tool results first
Anthropic describes tool result clearing as “one of the safest, lightest-touch ways to recover that space” when old tool calls can be repeated (Anthropic cookbook). The cookbook’s clear_tool_uses approach replaces old tool_result bodies with a placeholder such as [cleared…] while keeping the tool_use record. That preserves the fact that the action happened, while removing bulky output that can be regenerated.
The cookbook example is concrete. With three file reads of about 40K tokens each and keep=1, context fell from 128,740 tokens to 43,060 tokens, a 67% reduction (Anthropic cookbook). In the same source, a baseline agent peaked at 335,279 tokens, with 96.3% from file-read results, 1.7% from reasoning, and the remaining ~1.9% from tool-call records. Tool output was the budget hog in that run.
That is the strongest default rule in this article: old tool results are usually the first spillable tier.
2. Compact dialogue before blind truncation
After tool outputs, compact dialogue. Anthropic describes compaction as summarizing and reinitializing the context, while keeping architectural decisions and unresolved bugs and discarding redundant tool outputs (Anthropic cookbook). The same source gives default triggers: compact at 150K; clear at 100K with keep=3.
This is not the same as cutting the oldest half of a chat transcript. Dialogue contains commitments, user preferences, constraints, and unresolved tasks. A good summary carries those forward in fewer tokens.
LangChain exposes lower-level trimming through trim_messages, including strategy values such as last and first, plus max_tokens, include_system, and allow_partial options (LangChain trim_messages). Those controls are useful, but trimming should be a policy endpoint. Prefer compaction when the content still matters.
3. Offload durable state to external memory
Long-term memory should not live permanently in the context window. Redis states the pattern plainly: move long-term memory outside the window (Redis, context window). LangChain’s context-engineering material also distinguishes writing information for later use from keeping every token in active context.
This is where a memory system earns its place. The budget decides how much room retrieval receives. The memory layer decides what fills that room.
Context window budget is also a placement problem
Budgeting is not only about amount. It is also about where tokens land.
Liu et al.’s “Lost in the Middle” paper shows a U-shaped pattern in long-context use: models perform better when relevant information appears near the beginning or end of the input, and worse when it sits in the middle (Liu et al., 2023). A bloated middle is doubly wasteful. It spends scarce budget and places content where the model may use it less effectively.
Apply that finding inside each zone:
- Put durable rules at the head of the protected prefix.
- Put the most relevant retrieved evidence at the head or tail of the retrieval block.
- Put the latest user intent near the tail.
- Avoid burying decisive constraints in the middle of a long history summary.
- Keep the output buffer unspent until generation.
This is also why indiscriminate middle truncation is risky. It may remove needed state, and it may leave the remaining evidence in weak positions.
Agent token budget and KV-cache prefix protection
A KV-cache-stable prefix is the portion of the prompt that remains byte-stable across turns so the model can reuse cached computation instead of recomputing from the start.
In context budgeting, treat the stable prefix as a committed sub-budget. Do not let retrieval growth or history overflow rewrite it turn to turn. A single-token change can invalidate cached prefixes from that point onward (Anthropic prompt caching). The time-to-first-token payoff of reusing a stable prefix is covered in our KV-cache context engineering article.
This article does not make cache mechanics the main spine; see KV-cache context engineering for that. The budgeting rule is narrower: reserve the stable prefix as non-reallocatable. If late overflow rewrites early tokens, eviction has already become expensive.
A practical eviction policy
A defensible agent budget can be written as an ordered policy:
if projected_context > budget:
clear_old_tool_results(keep_required=true)
if projected_context > budget:
compact_dialogue(
preserve=[
"user goal",
"architectural decisions",
"constraints",
"unresolved bugs",
"open tasks"
],
discard=[
"redundant tool outputs",
"repeated assistant phrasing",
"obsolete intermediate observations"
]
)
if projected_context > budget:
reduce_retrieval_to_ranked_cap(
preserve_citations=true,
place_best_evidence_at_zone_edges=true
)
if projected_context > budget:
offload_long_term_state_to_memory()
if projected_context > budget:
ask_for_scope_reduction_or_continue_in_chunks()This order protects answer quality better than one global “last N messages” policy. It also protects the stable prefix and output buffer. The agent first removes content that is bulky and reproducible, then compresses content that carries state, then narrows evidence, then moves durable memory outside the prompt.
For systems that already use orchestration, the distinction matters. Context compiler vs. orchestration separates prompt construction from runtime coordination. Budgeting belongs in the prompt-construction path, but it should receive signals from the orchestrator: active tools, current task, retrieved evidence, and remaining output requirements.
Where persistent memory fits
Mnemoverse is a persistent-memory engine for AI agents. It sits outside the budgeted context window and feeds it on demand through retrieval. Budgeting decides how much room to reserve for retrieved memory. Memory decides what evidence fills those slots.
That boundary keeps the claim modest. A memory layer does not remove the need for token budgeting. It makes budgeting more useful because retrieval gets a protected tier instead of being starved by static overhead, old tool results, or unbounded dialogue history.
See the Mnemoverse platform overview and API getting started for the product surface.
Common questions
What is context budgeting for AI agents?
Context budgeting is the practice of allocating a finite attention budget across context zones, then enforcing protected and spillable tiers so high-signal tokens survive pressure.
How should I allocate an agent token budget?
Start with explicit zones for system instructions, tool schemas, stable prefix, retrieved chunks, dialogue history, tool outputs, and output buffer. Give each zone a minimum, maximum, and eviction policy rather than relying on one global truncation rule.
What should an agent cut from context first?
Cut old tool results first, then compact dialogue, then offload long-term state to external memory. Anthropic describes tool result clearing as the safest, lightest compaction when tool calls can be repeated.
Are fixed percentage context budgets an industry standard?
No. Published percentage splits are useful references, but vendors mainly expose mechanisms such as clearing, compaction, trimming, and trigger thresholds rather than universal allocation percentages.
How does context budgeting protect KV-cache performance?
Treat the cache-stable prefix as a committed sub-budget that does not move turn to turn. Late overflow that rewrites early tokens can invalidate cached prefixes from the changed token onward.
Should I truncate or summarize old context?
Prefer compaction when the content still matters. Use truncation only with clear rules, because middle truncation can remove useful state and can place remaining content in low-attention positions.
Related
- Context Compiler — where budget-select fits into prompt construction.
- KV-Cache Context Engineering — deeper mechanics for stable-prefix reuse.
- Working Memory and AI Agents — the slots-versus-resources framing behind protected and ranked tiers.
- Context Compiler vs. Orchestration — how runtime coordination differs from context assembly.
Mnemoverse is a persistent-memory engine for AI agents: store, retrieve, and verify knowledge across sessions instead of starting cold.
