Manifesto
The Context Engineering Manifesto
Context engineering has an infrastructure and a craft problem. The industry is obsessed with the first and ignoring the second. This essay is about the second: the human – machine communication.
By Mariano Morera Sánchez · Founder, Atomic Products · ~3,400 words
Here is a scene I have watched play out dozens of times, across different teams, with minor variations.
A senior AI engineer opens a 600-word system prompt from a Python file, pastes it into Claude's web UI, manually deletes the Markdown headers she had been using for GPT-4, rewrites them as XML tags, re-pastes, notices she forgot to adjust the constraint positioning the way Claude prefers, deletes the block, starts over.
The prompt was not the problem. The prompt was excellent — carefully authored, tested, versioned in git, reviewed by a teammate. What broke was the craft around it: the discipline of taking a structured thought and shaping it into something three different LLMs would each interpret correctly. That discipline had no tool, no language, no structure that matched the complexity of the work. So she improvised. Not because she lacked skill — because the skill she was exercising had no professional surface to work on.
Everyone is talking about context engineering. Almost nobody is talking about the part where a human sits down and writes the thing.
01
The term we are all using now
First, credit where it is due. The phrase "context engineering" was popularised in mid-2025 by Andrej Karpathy, Tobi Lütke, Walden Yan, Ankur Goyal, Dex Horthy, and a handful of others who were trying to name something the existing vocabulary could not capture. Karpathy called it "the delicate art and science of filling the context window with just the right information for the next step." Simon Willison noted that unlike "prompt engineering" — which many people read as a pretentious term for typing things into a chatbot — "context engineering" had an inferred definition much closer to the actual work involved. Gartner formalised it as an enterprise discipline. By early 2026, the industry consensus had shifted: most data and AI leaders agreed that prompt engineering alone was no longer sufficient to power AI at scale.
Context engineering is the discipline of designing the full informational environment an LLM receives. Not just the prompt, but retrieved knowledge, tool outputs, conversation history, memory state, system instructions, output schemas — everything the model sees before it generates the next token.
Context engineering matters because prompt engineering assumed a static world. You wrote a prompt, tested it, maybe iterated a few times, and shipped it. The problems showed up when LLM applications needed to call external tools, retrieve documents dynamically, maintain state across dozens of turns, or coordinate multiple agents. In those scenarios the prompt is just one piece of what the model sees. You can have a perfect prompt and still get terrible results if the surrounding context is wrong. The reframe is correct. I am not here to argue with the reframe.
I am here to argue that the reframe has been collapsed into something smaller than its own definition says it is.
02
The authoring gap
Here is the question I cannot find in any of the published pieces on context engineering, and I have read most of them by now:
When an LLM reads its context window, a large portion of what it sees — the role definition, the tone register, the domain knowledge, the task framing, the reasoning strategy, the examples, the output format, the constraints, the tool descriptions — was written by a human. Not retrieved. Not orchestrated. Authored. Where did that authored content come from? What tool? What structure? Who reviewed it? What does the craft of writing it look like, for a team, at scale, over time?
The honest answer, in 2026, is: a Google Doc, a comment at the top of a Python file, a prompt buried in a LangChain template, a Slack thread someone promised to document later. Or, increasingly, a "prompt management platform" that is — let us be honest with each other — a version-controlled text area with logging bolted on the back.
Notice the asymmetry
Runtime half
Vector databases, retrieval pipelines, agent frameworks, tool orchestration, evaluation harnesses, observability stacks, token budgeting libraries, context compression techniques.
Has real infrastructure
Authoring half
A rectangular text box that looks exactly like the one you had in 2022. No IDE. No structure. No compilation. No quality metrics.
Has no equivalent
Runtime infrastructure cannot fix authoring failures. If the role is ambiguous, no amount of retrieval fixes it. If the output format is hand-waved, no amount of evaluation catches the drift before production. Authoring is upstream of everything. Bad authoring poisons the well that runtime tries to draw from.
So why has it been ignored? My honest read: because authoring feels subjective. It feels like writing. It feels like prose. It does not photograph well on a dashboard. Infrastructure teams ship infrastructure because there is a KPI attached; there is no data team whose job title reads "quality of the authored input." So the work gets delegated to whoever has the patience, in whatever tool happens to be open, under whatever process the individual happened to invent that week.
Prompts are code. They deserve the same discipline: structure, version control, reusability, and tools that compile cleanly to where they need to run.
03
Authored context has a structure
The industry-standard answer to "what are the components of a prompt" has been stable for a couple of years. Google, OpenAI, Anthropic, and DAIR.AI converge on something like six pieces: a role, context, a task, examples, an output format, and constraints. Every serious prompt engineering framework teaches some version of this list.
The six-component model works well for individual prompts. For production systems, it has three honest weaknesses.
Tone is buried inside Role
A senior support engineer and a peer-support community volunteer can perform the same task with the same constraints — and yet the two system prompts read very differently because the persona and the voice register are different. Tone deserves its own slot.
Reasoning is missing
Chain-of-thought, tree-of-thought, ReAct, step-back prompting, self-consistency — these are reasoning strategies that materially change output quality. The task and the reasoning strategy can be independently modified. They deserve independent fields.
Tools are missing
Every production agent has tool definitions — function schemas, API descriptions, parameter lists, return types. These are a distinct architectural component that sits inside the authored portion of the context.
If you make those three moves — separate tone from role, promote reasoning to a first-class concern, add tools — you get a nine-field prompt anatomy.
You do not have to use all nine at once. Start with three — Role, Context, Task — the irreducible core. Graduate to five when you need tone control and output structure. Use all nine when the work demands it. That progressive disclosure — three tiers of complexity on the same underlying anatomy — is how you make structured prompting accessible without dumbing it down.
04
Same content, different encoding
Claude reads XML tags natively; each field gets wrapped in its own element. GPT-4 prefers Markdown headers with concise formatting. Gemini responds best to uppercase labels with critical restrictions positioned clearly. Lovable and other dev-agent targets want natural prose — no tags, no headers, just paragraphs that read like a product owner writing to an engineering team.
The content of the Role field does not change between these targets. The encoding does. And yet the dominant workflow in 2026 is: write the prompt once, manually reformat it for each target, hope you remember all the platform-specific rules, lose information in each pass, and accept that the multi-platform version of your prompt will drift from the original within two weeks.
Claude
XML tags
<role>...</role>
GPT-4
Markdown headers
## Role ...
Gemini
Uppercase labels
ROLE: ...
Lovable
Natural prose
You are a...
If the content is stable and the encoding is platform-specific, the encoding is a compilation step. You do not hand-write assembly for every CPU architecture. You write source code, and a compiler emits the platform-specific output.
Authored context should be written once, in a canonical structured form, and compiled for each target platform at the point of use.
05
What this looks like in practice
Imagine a customer-support triage agent that classifies incoming tickets, drafts internal summaries, and flags anything urgent for human review. Here is how that looks authored with the nine fields:
customer-support-triage.prompt — canonical form
Senior support engineer at a B2B SaaS company, specialised in triage for a 10,000-customer product
Concise, analytical, no customer-facing language — this is an internal classifier
Product description, customer's plan, ticket history, urgency taxonomy (critical / high / medium / low)
Classify urgency and department, draft one-paragraph internal summary, flag for human review if critical
Assess functional impact → match to urgency → identify department → decide on flag
Login failure during time-sensitive demo → critical, technical. Dark-mode feature request → low, general
Structured JSON with urgency, department, reasoning, summary, flagged
Never promise refunds; never mention internal tooling; urgency based on functional impact, not emotional tone
lookup_customer_history(), lookup_similar_tickets(), escalate_to_human()
That is one canonical authored artifact. Now compile it. For Claude, each field becomes an XML element. For GPT-4, a Markdown section. For Gemini, an uppercase label. Three different outputs, one source.
Notice what is not in this picture. We have not talked about retrieval, vector stores, token budgets, or memory strategies. Those are real and necessary — and the topic of a different conversation. This is the authored half. Good authoring does not replace good retrieval. It makes good retrieval effective by giving the model a clear frame to interpret retrieved information inside.
06
Why this matters now
There is a window closing. Three reasons it is worth naming the authoring half of context engineering in 2026 rather than waiting.
Authoring failures compound at scale
A misformatted role field that worked in demo breaks silently when the underlying model is updated. A constraint that was obvious to the original author is ambiguous to the next engineer. Teams that have shipped fifty prompts cannot absorb this informally. The dominant bug class in production LLM apps right now is fundamentally an authoring problem wearing a runtime symptom.
Nobody is building on a single model anymore
Teams that picked Claude in 2024 are evaluating GPT-5 and Gemini 2.5 Flash. The 'compile once, deploy anywhere' discipline that software got in the 1970s is being reinvented for prompts — one manually-reformatted prompt at a time.
The craft gap produces a talent gap
Companies are trying to hire prompt engineers and context engineers and are mostly finding infrastructure people. The actual craft — writing a structured prompt that a human can read, a system can compile, and a model can execute reliably — has no standard tooling, no formal curriculum, and no shared vocabulary.
I might be wrong about the number nine. I might be wrong about the specific field ordering. I am not wrong about the underlying observation: authoring is engineering, and we have been treating it as writing.
07
What prompt-x is
The argument above implies a tool. Nine named fields. Compile-time platform formatting. Variable substitution. Quality scoring that checks authored content the way a linter checks code. Version control on every field. One-click deployment to the target platform.
prompt-x is a context engineering tool built for the authored half of the problem. It is the structured prompt editor that treats prompt anatomy as a first-class concern — not a text area with logging bolted on. It is not a runtime platform. It is not an observability tool. It is not a prompt marketplace. It is the IDE for prompt engineering.
The editor understands the nine fields. It enforces the cognitive flow. It offers three complexity levels — start with three fields for quick work, graduate to five for production prompts, use all nine when the task demands it. CLEAR scoring evaluates what you have written against what each field should contain. The compilation engine handles the encoding so you never hand-reformat again.
Whether you use prompt-x, build your own internal tooling, or keep your prompts in a Google Doc, the underlying observation stands: authoring deserves to be treated like engineering. If this essay convinces you of that, it has done its job.
Frequently asked questions
Footnotes and sources
- Andrej Karpathy on context engineering
- Tobi Lütke on context engineering
- Dex Horthy's original context engineering diagram
- Elvis Saravia (DAIR.AI), Context Engineering Guide
- Simon Willison on 'context engineering' framing
- Gartner, Context Engineering: Why It's Replacing Prompt Engineering
- Firecrawl, Context Engineering vs Prompt Engineering for AI Agents
Start building prompt's foundations.
prompt-x is currently in beta. Join the waitlist for early access.
Start free.
prompt-x