12-Factor Agents: The Production Playbook for Building Re...

Most AI agent projects die in the demo phase. They work on a recorded screen, then fall apart when real users hit them with edge cases, long-running tasks, or just slightly wrong prompts.

The 12-Factor Agents project (21,500+ GitHub stars as of May 2026) addresses this head-on. Written by Dex Horthy from HumanLayer, it's a set of principles for building LLM-powered software that's actually production-ready — inspired by the original 12-Factor App methodology that defined how we build web apps.

I've gone through all 12 factors. Here's what each one means in practice, which frameworks get them right, and where most teams trip up.

The Core Insight

The 12-Factor Agents thesis is deceptively simple: most "AI agents" in production are mostly deterministic code with LLM steps sprinkled in at exactly the right points. They are not "here's a prompt, here's a bag of tools, loop until you hit the goal."

This matches what we see tracking AI tools on NeuralStackly. The teams shipping reliable agents — not demo reels — are the ones who treat the LLM as a component, not the entire system.

Factor 1: Natural Language to Tool Calls

The first building block: convert natural language into structured tool calls. Not complex agentic loops — just a single translation step.

"Can you create a payment link for $750?" → CreatePaymentLink(amount=750, currency="USD")

This is the foundation. Every agent framework that's worth using supports this: LangGraph (32,500+ stars), CrewAI (51,500+ stars), OpenAI Agents SDK (26,500+ stars), and Pydantic AI (17,000+ stars) all handle tool calling. The difference is how much ceremony they wrap around it.

The principle says: start here. Don't jump to multi-step agents. Get single-step tool calling working reliably first.

Where teams trip up: Building a 10-tool agent before getting one tool call right.

Factor 2: Own Your Prompts

Don't outsource your prompt engineering to a framework. Many frameworks provide a "black box" approach where you configure an agent with role, goal, personality, and tools — and the framework handles the rest.

The problem: you can't debug what you can't see. When an agent hallucinates in production, you need to know exactly what prompt produced the bad output.

This means:

•Your prompts should live in your codebase, not in framework internals
•You should be able to version them alongside your application code
•You should be able to A/B test them without framework approval

Tools that respect this: Mastra (24,000+ stars) keeps prompts as first-class objects in your TypeScript code. Pydantic AI makes prompts explicit Python functions. The MCP ecosystem — with servers like the Playwright MCP (32,500+ stars) — forces you to define tool schemas explicitly, which is a form of prompt ownership.

Where teams trip up: Using Agent(role="...", goal="...") abstractions and never looking at what actually gets sent to the LLM.

Factor 3: Own Your Context Window

This is the factor that launched a thousand "context engineering" posts. The principle: everything is context engineering. LLMs are stateless functions. Every call is: "here's what happened so far, what's the next step?"

You don't have to use standard message arrays. You can format context however you want — structured JSON, condensed summaries, retrieved documents, whatever serves your use case. The key is you own it.

Context engineering in practice:

•Trim aggressively: Don't send 50KB of chat history when 2KB of summary works
•Fetch just-in-time: Use RAG to pull relevant context at query time, not upfront
•Compress state: After each step, reduce the accumulated context to what matters for the next step

Where teams trip up: Sending the entire conversation history to every LLM call and wondering why costs explode and quality degrades.

Factor 4: Tools Are Just Structured Outputs

Tools don't need to be complex. At their core, a tool call is just the LLM outputting JSON that your code parses into a function call.

# The LLM outputs:
{"tool": "SearchIssues", "query": "login bug", "status": "open"}
# Your code parses and executes it deterministically

This means you don't need a fancy tool framework. You need a reliable JSON parser and well-defined schemas. This is why structured output libraries (13,000+ stars for the Instructor library) are so popular — they make the LLM-to-JSON pipeline reliable.

The MCP protocol embodies this principle: tools are defined as JSON schemas, and the protocol handles the structured output parsing. It's tool calling reduced to its essence.

Factor 5: Unify Execution State and Business State

Many systems try to separate "where are we in the workflow?" (execution state) from "what data have we collected?" (business state). For AI agents, this separation often creates more complexity than it saves.

Instead, keep one unified state object. Your agent's current step, collected data, retry counts, and waiting status all live together. This makes debugging easier — you can snapshot the entire state at any point.

This is how LangGraph works: its state graph is a single object that flows through nodes. Each node reads and writes to the same state.

Factor 6: Launch, Pause, Resume

Real agents don't execute synchronously in a single request. They need to:

•Launch a long-running task
•Pause while waiting for a human approval, external API, or training pipeline
•Resume when the response comes back

This is where most agent demos break down. They run everything in-memory during a single HTTP request. Production agents need persistence — save state, release resources, and pick up where you left off.

Implementation patterns:

•Queue-based: agent writes state to a database, a worker picks it up later
•Webhook-based: agent pauses and registers a callback URL
•Event-driven: agent subscribes to events and reacts

See our AI DevOps guide for tools that handle long-running agent workflows.

Factor 7: Contact Humans with Tools

Humans-in-the-loop isn't optional for production agents. But the way most teams implement it — sending an email and hoping someone responds — is fragile.

The principle: treat "ask a human" as just another tool call. The agent calls RequestApproval(decision="Can I delete these 500 records?") and your system handles routing the question to the right person, collecting the response, and feeding it back to the agent.

This is what HumanLayer (the company behind 12-Factor Agents) builds. It's also a pattern you can implement yourself with a simple approval queue.

Factor 8: Own Your Control Flow

This is the big one. Don't let a framework decide how your agent loops. Own the control flow yourself.

Most agent frameworks provide a default loop: "call LLM, check for tool calls, execute tools, call LLM again, repeat." This works for demos. It doesn't work for production, where you need:

•Break conditions: Stop the loop when a specific tool is called (e.g., RequestApproval)
•Custom routing: After tool X, skip the LLM and go straight to tool Y
•Fallback paths: If the LLM fails, execute a deterministic fallback
•Parallel execution: Run independent tool calls simultaneously

We covered this in depth in our post on why agents need control flow, not more prompts. The 12-Factor approach agrees: your control flow should be explicit, debuggable, and yours.

Frameworks that support this: LangGraph gives you a graph with explicit edges. Mastra uses workflow definitions. The OpenAI Agents SDK has handoff primitives. All of them let you define control flow rather than accepting a default loop.

Factor 9: Compact Errors into the Context Window

When a tool fails, don't dump a 200-line stack trace into the context window. The LLM doesn't need your Python traceback — it needs a compact, actionable error message.

Bad:

Traceback (most recent call last):
  File "/app/tools/database.py", line 142, in execute_query
    cursor.execute(query)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "users_email_key"
DETAIL: Key (email)=(test@example.com) already exists.
... (50 more lines)

Good:

Tool Error: Database insert failed — a user with email "test@example.com" already exists.
Suggestion: Use update instead of insert, or check for existing record first.

This saves tokens, saves money, and actually helps the LLM recover. See our LLM observability guide for more on debugging agent failures.

Factor 10: Small, Focused Agents

Rather than building one monolithic agent that does everything, build small agents that do one thing well. Each agent handles a single domain or task type. A router or orchestrator dispatches to the right specialist.

This works because:

•Smaller context windows mean lower costs and higher accuracy
•Focused prompts are easier to debug and optimize
•Independent agents can be deployed, scaled, and updated separately

The OpenAI Agents SDK and CrewAI both support multi-agent patterns. Mastra has built-in agent collaboration primitives. See our agent frameworks comparison for the full landscape.

Where teams trip up: Building a "do everything" agent with 30 tools instead of three agents with 5 tools each.

Factor 11: Trigger from Anywhere

Agents shouldn't only be triggered by user messages in a chat interface. They should respond to:

•Webhooks from GitHub, Slack, Stripe
•Scheduled cron jobs
•Database changes
•File uploads
•API calls from other services

This is why MCP servers are important — they standardize how agents connect to external systems. The Playwright MCP (32,500+ stars) lets agents browse the web. The Figma Context MCP (14,500+ stars) gives agents access to design files.

Build your agents as event processors, not chatbots.

Factor 12: Make Your Agent a Stateless Reducer

The final principle: think of your agent as a foldl (reduce) operation. Given the current state and a new event, produce the next state.

(state, event) → new_state

No hidden state. No global variables. No side effects that aren't captured in the state object. This makes agents:

•Testable: You can test any step by providing a state and event
•Replayable: Given the initial state and sequence of events, you get the same result
•Debuggable: You can inspect the state at any point in the execution

This is the hardest principle to implement, but the one that pays off most in production. See our testing and QA tools for frameworks that support stateless agent testing.

Which Frameworks Best Match the 12 Factors?

No framework implements all 12 perfectly. But some come closer than others:

Framework	Stars	Best at	Weak on
LangGraph	32,500+	Control flow (F5, F8), unified state (F5)	Prompt ownership (F2) — prompts are wrapped in framework abstractions
OpenAI Agents SDK	26,500+	Tool calling (F1, F4), multi-agent (F10), handoffs	Context management (F3) — less control over context formatting
Mastra	24,000+	Prompt ownership (F2), TypeScript-native, MCP integration	Maturity — newer ecosystem
Pydantic AI	17,000+	Structured outputs (F4), type safety, Python-native	Long-running execution (F6) — mostly request/response
CrewAI	51,500+	Multi-agent (F10), role-based task routing	Control flow (F8) — abstracted behind crew abstractions

The 12-Factor approach suggests: don't pick one framework and commit. Use the right pattern for each factor, and build the glue yourself.

What This Means for Your Stack

The practical takeaway from 12-Factor Agents isn't "use this framework" or "avoid that pattern." It's a mindset shift:

1. Start simple. Get single-step tool calling working before building multi-step agents.

2. Own the critical path. Your prompts, your context, your control flow. Frameworks can help, but they shouldn't own these.

3. Design for failure. Agents will fail. Compact errors, persist state, and make resumption automatic.

4. Keep agents small. A 5-tool agent with great prompts beats a 30-tool agent with mediocre ones.

5. Test statelessly. If you can't replay an agent's execution from a saved state, you can't debug it in production.

If you're building agents for production, compare your stack against these 12 factors. The gaps you find are probably where your bugs live.

Explore the full agent tooling landscape: AI Agent Frameworks · Coding Agents · MCP Tools · Agent Observability · Testing & QA · AI DevOps · Open Source AI · Model Benchmarks