Skip to content

Context Engineering Needs a Compiler

Most teams now agree on the big label. The discipline is context engineering.

The harder question is smaller and more practical: what do you call the runtime component that actually builds the context window for each model call?

That job is not trivial. It must choose among retrieved passages, memory, system instructions, tool state, conversation history, and often untrusted external content. It must fit a hard token budget. It must decide what stays, what gets dropped, and why.

That component still lacks a clean, shared name. Different frameworks expose parts of it. Different vendors brand around it. But the responsibility itself is real, and it is increasingly visible in production practice.

Context engineering is the discipline of getting the right tokens into the model context window for a given task or turn.

A context compiler is the per-turn runtime component that plans, fetches, gates, ranks, deduplicates, budgets, sanitizes, and assembles one context window for one model call.

The case for this name is not that the layer is newly discovered. It is not. The field already says the work exists between retrieval and orchestration. The case is that the field still conflates three different things, and that the per-turn assembler deserves a precise spec.

TL;DR

  • Context engineering is now the umbrella term, with practitioner support from Anthropic and the Karpathy/Lütke usage, and academic formalization in the survey by Mei et al..
  • "Context" names three layers of one system: the discipline, the stored context an agent draws from, and the per-turn runtime assembler. This article separates them.
  • The per-turn assembly job sits between RAG and agent orchestration, owned cleanly by neither. Roadie and Redis describe that gap from different angles.
  • The name context compiler has credible lineage: DSPy at ICLR 2024 framed LLM systems in compile terms, and Google ADK in Dec 2025 described context as "a compiled view" built through observable passes.
  • No major framework ships this as one unified, named layer. LlamaIndex has many parts, LangGraph leaves budgeting largely to the developer, and MemGPT/Letta leans on LLM-directed assembly.

Context engineering, stored context, and the context compiler

The word "context" gets used for three layers of the same system. They relate — the discipline shapes the data, and the data feeds the runtime — but they are not the same thing, and separating them is most of the work.

SenseWhat it isAltitude
Context engineeringThe discipline and craft of getting the right tokens into the window.Strategy, design-time.
Stored contextThe raw context an agent can draw from, in any form — documents, a database, a knowledge graph, a repo, history.Data, at rest.
Context compilerThe per-turn runtime assembly of one window for one call.Runtime, per turn.

1) Context engineering is the discipline

This is the umbrella term. It covers the strategy and craft of deciding what the model should see and in what form.

That framing is now mainstream. Anthropic uses the term directly in "Effective context engineering for AI agents". The survey by Mei et al. formalizes it at research scale and reviews 1,400+ papers. Their taxonomy spans context retrieval and generation, processing, and management. In that framing, RAG is one implementation under the umbrella, not the umbrella itself.

That matters because many production failures are not retrieval failures. They are assembly failures. The right facts may exist, but the model never sees the right subset in the right order under the right budget.

2) Stored context is the data the agent draws from

A second meaning is context as stored state: the raw material an agent can pull from, in whatever form it takes — a pile of documents, a database, a knowledge graph, a repo, conversation history. This is context at rest. It is what the compiler draws from, not the finished window.

This layer is mostly well-named already — memory systems, vector stores, retrieval. It matters, and it feeds the runtime. But it is not the per-turn assembler.

A stored corpus is an input source. It does not decide what enters one specific model window on one specific turn. That decision is a separate, runtime job — which is the third sense.

3) The context compiler is the runtime assembly component

The third sense is the one that still gets treated piecemeal.

This component takes all candidate inputs for a single call and turns them into one bounded, execution-ready context window. Its job is operational, not conceptual. It runs every turn, under time and token limits, with concrete inclusion and exclusion decisions.

This is why "compiler" is the right frame. The output is not the knowledge base. It is not the workflow graph. It is a compiled view for one call.

Google's Agent Development Kit post from Dec 4, 2025 makes that framing explicit: "Context is a compiled view over a richer stateful system." The same post describes flows and processors as a compiler pipeline and stresses that the compilation step should be observable and testable.

That language did not appear from nowhere. DSPy, published at ICLR 2024 as "DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines," established compilation as a serious framing for LLM systems. DSPy compiles the program. ADK extends the framing to the context view. "Context compiler" is the natural name for the runtime assembly component that follows from that lineage — though two anchors are a slim base, so treat the name as a proposal grounded in a real pattern, not a settled standard.

Why context assembly falls between RAG and orchestration

The reason this component remains under-specified is structural. It sits between categories that already have names, owned cleanly by neither.

RAG owns retrieval. Agent frameworks own workflow and tool execution. But the final per-turn assembly job is broader than retrieval and narrower than orchestration.

David Tuite at Roadie makes a related argument: RAG is only one of several context slots, and treating it as the whole system is what drives production failures. That matches what many teams see in practice. A pipeline can retrieve documents correctly and still fail because chat history was trimmed badly, tool output was left unframed, conflicting memories were merged without resolution, or the final window exceeded budget and got truncated in the wrong place.

Redis's context orchestration post describes the runtime layer as the process that builds the window for each LLM call and "ranks, trims, and merges everything into a token-budgeted bundle." Redis also separates context orchestration from LLM orchestration. That distinction is useful. It marks the runtime assembly layer as a separate responsibility.

This article uses compiler rather than orchestration for one reason: orchestration suggests workflow timing and control flow. A context compiler is a deterministic, observable transform pipeline with passes, constraints, and inspectable output. For the per-turn assembly problem, that is the sharper model.

The two are not rivals. An orchestrator runs the steps; the compiler builds the window each step needs, and the orchestrator calls it. What the compiler reclaims is only the assembly work that frameworks leave implicit today — not the orchestration role itself.

The absence of a named, unified layer produces predictable consequences: brittle prompt templates, ad-hoc truncation logic, missing provenance, non-reproducible contexts, and security gaps. Engineers rebuild the same assembly pipeline in every serious project.

What a context compiler must do

A useful category name needs a useful component spec. These requirements describe a transform pipeline, not a black-box service. At minimum, a context compiler should do the following.

Budget context under a hard token cap

The output must fit a hard limit. The compiler cannot assume infinite context or rely on accidental truncation by downstream APIs. It must know the budget and never overshoot it.

This sounds obvious. It is still where many systems fail. "Include everything" is not a strategy.

Rank and deduplicate candidate content

Inputs compete for limited space. The compiler needs a composite basis for ranking. Relevance matters. Freshness matters. Redundancy matters.

If two fragments say the same thing, both should not survive unless the duplication itself carries signal. Deduplication is not optional once history, retrieval, memory, and tool traces all enter the same window.

Order and place the surviving fragments

Selecting what survives is only half the job. The compiler also decides where each fragment sits in the window — system instructions first, the most relevant or freshest material where the model will actually attend to it, history and lower-priority content placed deliberately. Position is not cosmetic: models attend unevenly across a long window (the "lost in the middle" effect), so the same fragments in a different order can produce a different answer. Ranking decides whether a fragment is in; placement decides where.

Preserve provenance

Every fragment should carry a reason for inclusion. The system should also record what it dropped and why.

This is the difference between an opaque bundle and an auditable one. If a context window causes a bad answer, the team should be able to inspect the assembly trace instead of guessing. Verification depends on provenance, and this is a frontier that current frameworks do not fully demonstrate.

Apply a security pass

Retrieved pages, tool output, and user-provided files are untrusted inputs. The compiler should frame or sanitize them before final assembly.

That is not just a prompt hygiene issue. It is part of context construction. A system that retrieves aggressively but compiles naively increases its attack surface.

Default to deterministic assembly

A compiler should be observable, testable, and reproducible by default.

An LLM can still help. It can summarize, compress, or propose candidates. But the default floor should be deterministic enough that a team can rerun the same turn, inspect the same passes, and explain the same output.

That is the core contrast with fully LLM-directed assembly.

Keep the assembly KV-cache friendly

A production compiler assembles with the cache in mind. Most of a turn's cost and latency comes from reprocessing the prompt prefix, so a stable, append-only prefix lets the model reuse its KV-cache instead of recomputing it. A compiler that reorders or rewrites the early window every turn quietly destroys that reuse. Placement and caching therefore interact: keep the stable parts stable, and confine churn to where it earns its cost. See KV Cache and Context Engineering.

Resolve conflicts and weight by recency — at assembly time

Two fragments can disagree. Some sources deserve more trust. Older context may be less relevant now. The compiler settles this for one window: it weights candidates by recency and source quality during ranking, and decides which of two conflicting fragments enters the window, or how both are framed.

Note the boundary. Decay and aging — letting unreinforced memories fade, consolidating the store over time — belong to the stored-context layer, not the compiler. The compiler does not forget; it reads what the store kept and decides, this turn, what survives into the window. The survey by Mei et al. treats freshness and conflict handling as likely greenfield for this runtime layer: teams know the problem exists, but public framework support is still thin.

A minimal context compiler pipeline

A practical model is a pass-based pipeline:

  1. Plan what slots are available for this turn.
  2. Fetch candidates from retrieval, memory, conversation state, tools, and policies.
  3. Gate inputs by trust, type, and policy.
  4. Rank candidates by relevance, freshness, and source quality.
  5. Deduplicate overlapping or redundant fragments.
  6. Budget-select under a hard token cap.
  7. Sanitize untrusted content and frame it clearly.
  8. Assemble the final ordered context window with provenance.

That pipeline is a category spec, and it explains why the compiler framing helps. Each stage is a pass. Each pass can be tested. Each pass can emit artifacts.

A workflow engine can call this pipeline. A retrieval system can feed it. A memory system can enrich it. None of those components replaces it.

For a related view of how working context differs from longer-lived state, see KV Cache and Context Engineering. For adjacent questions about memory evaluation, see Evaluating Agent Memory.

Context compiler examples: what frameworks do today

No major framework is empty here. The issue is not absence of features. The issue is fragmentation.

LlamaIndex: closest, but still a set of parts

LlamaIndex comes closest to a practical context compiler shape.

It has composable memory blocks. It exposes a global token_limit. It uses priority-order truncation, with priority 0 always kept. It supports pluggable rerankers and node postprocessors. It also offers packing modes such as compact, refine, and tree summarization.

Those are real compiler-like passes. But they are still presented as separate composable mechanisms, not one named, provenance-aware runtime assembly layer with a single observable contract.

That distinction matters. A toolkit can contain a parser, optimizer, and linker without yet presenting a full compiler abstraction.

LangGraph: strong workflow model, weak automatic budgeting

LangGraph makes a clean distinction between working state and persistent state. That is useful for agent design.

But for per-turn context assembly, budgeting is largely the developer's problem. Message trimming and summarization are opt-in patterns. The .compile() step validates graph structure, not final context composition.

That means the framework helps you build the workflow. It does not by itself guarantee a ranked, budgeted, per-turn context window.

MemGPT and Letta: flexible, but LLM-directed

MemGPT and Letta are important because they make memory selection an active part of agent behavior.

The tradeoff is that assembly is more LLM-directed through self-managed function calls and memory operations. That can be powerful. It is also harder to make deterministic, easier to vary across runs, and more difficult to audit at the pass level.

A context compiler can still include LLM-assisted steps. But if the whole assembly path depends on model-driven decisions, the system moves away from the "observable and testable" standard that Google ADK argues for.

The gap, stated plainly

As of mid-2026, and based on the public documentation reviewed here, no major framework ships ranked, budgeted, provenance-aware, deterministic per-turn assembly as one observable, testable component. The parts exist; the named layer does not. That gap is the article's payoff.

Why naming this layer helps

Good names do operational work.

If a team treats context assembly as "whatever happens after retrieval," it becomes nobody's contract. Bugs spread across prompts, retrievers, middleware, memory stores, and agent logic. Failures become hard to root-cause.

If the same team names a context compiler, the contract sharpens:

  • one input surface for candidate context
  • one budget owner
  • one ranking and dedup policy
  • one provenance trail
  • one security pass
  • one final artifact per call

That does not force one implementation style. It does force accountability. It also guides evaluation. Instead of asking whether a framework "handles context well," teams can ask concrete questions: does it expose a configurable compiler pipeline, can ranking strategies be swapped without rewriting orchestration code, and does provenance survive the full assembly pass?

There is a cultural effect, too. "Compiler" is a serious word. Teams treat a compiler as a real, testable artifact — passes, fixtures, reproducible output, regression tests — not as "whatever happens after retrieval." Naming the layer a compiler imports that standard: it invites the rigor the work already deserves but rarely gets. The name is not only a label; it sets the bar.

A note on the long-context reflex: as context windows expand, developers often assume that precise context management is no longer necessary. This is a mistake. Larger windows raise cost and latency and still degrade quality when relevant tokens sit among irrelevant ones. A bigger budget does not remove the need for a disciplined budget owner.

The field is already moving in this direction. Anthropic frames the broader discipline. Mei et al. formalize it. Roadie identifies the ownership gap. Redis names the runtime assembly layer. Google ADK shows the compile metaphor applied directly to context. The missing piece is not awareness that the work exists. The missing piece is a precise, shared abstraction for the per-turn assembler.

That abstraction is the context compiler.

This is early. An agent already pulls context from many places — editors, chat tools, retrieval, memory, tool output — and that list keeps growing. As it does, the per-turn assembler stops looking like a configuration step and starts to look like a renderer: it builds the agent's working view of its world, fresh each turn, under a fixed budget. Most of that surface is still unbuilt. Naming it is the first step; building it well is the work ahead.

Common questions

What is context engineering?

Context engineering is the discipline of getting the right tokens into the model context window for a given task or turn. In the current literature and practitioner usage, it is the umbrella term, and RAG is one implementation inside it, not the whole field.

What is a context compiler?

A context compiler is the per-turn runtime component that plans, fetches, gates, ranks, deduplicates, budgets, sanitizes, and assembles one context window for one model call. The term fits because the job is a deterministic, observable transform from many possible inputs into a bounded execution-ready view.

How is a context compiler different from context orchestration?

Context orchestration usually names the runtime assembly process that builds the window for each LLM call, while context engineering names the broader discipline. This article argues that compiler is the more precise name for the assembly component when the work is implemented as a testable sequence of passes.

Is RAG the same thing as context engineering?

No. The survey by Mei et al. describes context engineering as the broader umbrella, with RAG as one system implementation under it. Retrieval matters, but it does not cover ranking non-retrieval slots, budgeting, deduplication, sanitization, or final assembly.

Do major agent frameworks already ship a context compiler?

Not as one unified, named layer. LlamaIndex has many of the parts as separate stages, LangGraph leaves budgeting largely to the developer, and MemGPT-Letta relies on LLM-directed assembly that is harder to make deterministic and auditable.

Does a context compiler replace my agent framework or orchestrator?

No. They sit on different axes. An orchestrator such as LangGraph or CrewAI manages control flow over time: the sequence of steps, branches, tool calls, and state across turns. A context compiler builds the window for one model call within that flow. The orchestrator calls the compiler; it does not compete with it. What the compiler takes over is the context-assembly responsibility that frameworks handle today implicitly and unevenly, by naming it and owning it as a separate, swappable layer.


Published in the Mnemoverse Library for AI engineers working on memory, context, and agent systems.