12-Factor Agents: The Production Playbook for Building Reliable LLM Software
The 12-Factor Agents methodology (21k+ GitHub stars) defines how to build production-grade AI agents. Here's what each factor means and which tools implement them.
12-Factor Agents: The Production Playbook for Building Reliable LLM Software
Most AI agent projects die in the demo phase. They work on a recorded screen, then fall apart when real users hit them with edge cases, long-running tasks, or just slightly wrong prompts.
The 12-Factor Agents project (21,500+ GitHub stars as of May 2026) addresses this head-on. Written by Dex Horthy from HumanLayer, it's a set of principles for building LLM-powered software that's actually production-ready โ inspired by the original 12-Factor App methodology that defined how we build web apps.
I've gone through all 12 factors. Here's what each one means in practice, which frameworks get them right, and where most teams trip up.
The Core Insight
The 12-Factor Agents thesis is deceptively simple: most "AI agents" in production are mostly deterministic code with LLM steps sprinkled in at exactly the right points. They are not "here's a prompt, here's a bag of tools, loop until you hit the goal."
This matches what we see tracking AI tools on NeuralStackly. The teams shipping reliable agents โ not demo reels โ are the ones who treat the LLM as a component, not the entire system.
Factor 1: Natural Language to Tool Calls
The first building block: convert natural language into structured tool calls. Not complex agentic loops โ just a single translation step.
"Can you create a payment link for $750?" โ CreatePaymentLink(amount=750, currency="USD")
This is the foundation. Every agent framework that's worth using supports this: LangGraph (32,500+ stars), CrewAI (51,500+ stars), OpenAI Agents SDK (26,500+ stars), and Pydantic AI (17,000+ stars) all handle tool calling. The difference is how much ceremony they wrap around it.
The principle says: start here. Don't jump to multi-step agents. Get single-step tool calling working reliably first.
Where teams trip up: Building a 10-tool agent before getting one tool call right.
Factor 2: Own Your Prompts
Don't outsource your prompt engineering to a framework. Many frameworks provide a "black box" approach where you configure an agent with role, goal, personality, and tools โ and the framework handles the rest.
The problem: you can't debug what you can't see. When an agent hallucinates in production, you need to know exactly what prompt produced the bad output.
This means:
- โขYour prompts should live in your codebase, not in framework internals
- โขYou should be able to version them alongside your application code
- โขYou should be able to A/B test them without framework approval
Tools that respect this: Mastra (24,000+ stars) keeps prompts as first-class objects in your TypeScript code. Pydantic AI makes prompts explicit Python functions. The MCP ecosystem โ with servers like the Playwright MCP (32,500+ stars) โ forces you to define tool schemas explicitly, which is a form of prompt ownership.
Where teams trip up: Using Agent(role="...", goal="...") abstractions and never looking at what actually gets sent to the LLM.
Factor 3: Own Your Context Window
This is the factor that launched a thousand "context engineering" posts. The principle: everything is context engineering. LLMs are stateless functions. Every call is: "here's what happened so far, what's the next step?"
You don't have to use standard message arrays. You can format context however you want โ structured JSON, condensed summaries, retrieved documents, whatever serves your use case. The key is you own it.
Context engineering in practice:
- โขTrim aggressively: Don't send 50KB of chat history when 2KB of summary works
- โขFetch just-in-time: Use RAG to pull relevant context at query time, not upfront
- โขCompress state: After each step, reduce the accumulated context to what matters for the next step
Where teams trip up: Sending the entire conversation history to every LLM call and wondering why costs explode and quality degrades.
Factor 4: Tools Are Just Structured Outputs
Tools don't need to be complex. At their core, a tool call is just the LLM outputting JSON that your code parses into a function call.
# The LLM outputs:
{"tool": "SearchIssues", "query": "login bug", "status": "open"}
# Your code parses and executes it deterministically
This means you don't need a fancy tool framework. You need a reliable JSON parser and well-defined schemas. This is why structured output libraries (13,000+ stars for the Instructor library) are so popular โ they make the LLM-to-JSON pipeline reliable.
The MCP protocol embodies this principle: tools are defined as JSON schemas, and the protocol handles the structured output parsing. It's tool calling reduced to its essence.
Factor 5: Unify Execution State and Business State
Many systems try to separate "where are we in the workflow?" (execution state) from "what data have we collected?" (business state). For AI agents, this separation often creates more complexity than it saves.
Instead, keep one unified state object. Your agent's current step, collected data, retry counts, and waiting status all live together. This makes debugging easier โ you can snapshot the entire state at any point.
This is how LangGraph works: its state graph is a single object that flows through nodes. Each node reads and writes to the same state.
Factor 6: Launch, Pause, Resume
Real agents don't execute synchronously in a single request. They need to:
- โขLaunch a long-running task
- โขPause while waiting for a human approval, external API, or training pipeline
- โขResume when the response comes back
This is where most agent demos break down. They run everything in-memory during a single HTTP request. Production agents need persistence โ save state, release resources, and pick up where you left off.
Implementation patterns:
- โขQueue-based: agent writes state to a database, a worker picks it up later
- โขWebhook-based: agent pauses and registers a callback URL
- โขEvent-driven: agent subscribes to events and reacts
See our AI DevOps guide for tools that handle long-running agent workflows.
Factor 7: Contact Humans with Tools
Humans-in-the-loop isn't optional for production agents. But the way most teams implement it โ sending an email and hoping someone responds โ is fragile.
The principle: treat "ask a human" as just another tool call. The agent calls RequestApproval(decision="Can I delete these 500 records?") and your system handles routing the question to the right person, collecting the response, and feeding it back to the agent.
This is what HumanLayer (the company behind 12-Factor Agents) builds. It's also a pattern you can implement yourself with a simple approval queue.
Factor 8: Own Your Control Flow
This is the big one. Don't let a framework decide how your agent loops. Own the control flow yourself.
Most agent frameworks provide a default loop: "call LLM, check for tool calls, execute tools, call LLM again, repeat." This works for demos. It doesn't work for production, where you need:
- โขBreak conditions: Stop the loop when a specific tool is called (e.g.,
RequestApproval) - โขCustom routing: After tool X, skip the LLM and go straight to tool Y
- โขFallback paths: If the LLM fails, execute a deterministic fallback
- โขParallel execution: Run independent tool calls simultaneously
We covered this in depth in our post on why agents need control flow, not more prompts. The 12-Factor approach agrees: your control flow should be explicit, debuggable, and yours.
Frameworks that support this: LangGraph gives you a graph with explicit edges. Mastra uses workflow definitions. The OpenAI Agents SDK has handoff primitives. All of them let you define control flow rather than accepting a default loop.
Factor 9: Compact Errors into the Context Window
When a tool fails, don't dump a 200-line stack trace into the context window. The LLM doesn't need your Python traceback โ it needs a compact, actionable error message.
Bad:
Traceback (most recent call last):
File "/app/tools/database.py", line 142, in execute_query
cursor.execute(query)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "users_email_key"
DETAIL: Key (email)=(test@example.com) already exists.
... (50 more lines)
Good:
Tool Error: Database insert failed โ a user with email "test@example.com" already exists.
Suggestion: Use update instead of insert, or check for existing record first.
This saves tokens, saves money, and actually helps the LLM recover. See our LLM observability guide for more on debugging agent failures.
Factor 10: Small, Focused Agents
Rather than building one monolithic agent that does everything, build small agents that do one thing well. Each agent handles a single domain or task type. A router or orchestrator dispatches to the right specialist.
This works because:
- โขSmaller context windows mean lower costs and higher accuracy
- โขFocused prompts are easier to debug and optimize
- โขIndependent agents can be deployed, scaled, and updated separately
The OpenAI Agents SDK and CrewAI both support multi-agent patterns. Mastra has built-in agent collaboration primitives. See our agent frameworks comparison for the full landscape.
Where teams trip up: Building a "do everything" agent with 30 tools instead of three agents with 5 tools each.
Factor 11: Trigger from Anywhere
Agents shouldn't only be triggered by user messages in a chat interface. They should respond to:
- โขWebhooks from GitHub, Slack, Stripe
- โขScheduled cron jobs
- โขDatabase changes
- โขFile uploads
- โขAPI calls from other services
This is why MCP servers are important โ they standardize how agents connect to external systems. The Playwright MCP (32,500+ stars) lets agents browse the web. The Figma Context MCP (14,500+ stars) gives agents access to design files.
Build your agents as event processors, not chatbots.
Factor 12: Make Your Agent a Stateless Reducer
The final principle: think of your agent as a foldl (reduce) operation. Given the current state and a new event, produce the next state.
(state, event) โ new_state
No hidden state. No global variables. No side effects that aren't captured in the state object. This makes agents:
- โขTestable: You can test any step by providing a state and event
- โขReplayable: Given the initial state and sequence of events, you get the same result
- โขDebuggable: You can inspect the state at any point in the execution
This is the hardest principle to implement, but the one that pays off most in production. See our testing and QA tools for frameworks that support stateless agent testing.
Which Frameworks Best Match the 12 Factors?
No framework implements all 12 perfectly. But some come closer than others:
| Framework | Stars | Best at | Weak on |
|---|---|---|---|
| LangGraph | 32,500+ | Control flow (F5, F8), unified state (F5) | Prompt ownership (F2) โ prompts are wrapped in framework abstractions |
| OpenAI Agents SDK | 26,500+ | Tool calling (F1, F4), multi-agent (F10), handoffs | Context management (F3) โ less control over context formatting |
| Mastra | 24,000+ | Prompt ownership (F2), TypeScript-native, MCP integration | Maturity โ newer ecosystem |
| Pydantic AI | 17,000+ | Structured outputs (F4), type safety, Python-native | Long-running execution (F6) โ mostly request/response |
| CrewAI | 51,500+ | Multi-agent (F10), role-based task routing | Control flow (F8) โ abstracted behind crew abstractions |
The 12-Factor approach suggests: don't pick one framework and commit. Use the right pattern for each factor, and build the glue yourself.
What This Means for Your Stack
The practical takeaway from 12-Factor Agents isn't "use this framework" or "avoid that pattern." It's a mindset shift:
1. Start simple. Get single-step tool calling working before building multi-step agents.
2. Own the critical path. Your prompts, your context, your control flow. Frameworks can help, but they shouldn't own these.
3. Design for failure. Agents will fail. Compact errors, persist state, and make resumption automatic.
4. Keep agents small. A 5-tool agent with great prompts beats a 30-tool agent with mediocre ones.
5. Test statelessly. If you can't replay an agent's execution from a saved state, you can't debug it in production.
If you're building agents for production, compare your stack against these 12 factors. The gaps you find are probably where your bugs live.
Explore the full agent tooling landscape: AI Agent Frameworks ยท Coding Agents ยท MCP Tools ยท Agent Observability ยท Testing & QA ยท AI DevOps ยท Open Source AI ยท Model Benchmarks
Share this article
About NeuralStackly
Expert researcher and writer at NeuralStackly, dedicated to finding the best AI tools to boost productivity and business growth.
View all postsRelated Articles
Continue reading with these related posts
Agent Model Routing: When Small Models Beat Frontier Models for Tool Calling
Agent Model Routing: When Small Models Beat Frontier Models for Tool Calling
Most AI agent steps don't need GPT-5 or Claude Opus. Here's how to route structured tasks to cheaper, faster small models โ with real cost and latency numbers.
MCP Tool Calling vs Code Execution: The New Battleground for AI Agents
MCP Tool Calling vs Code Execution: The New Battleground for AI Agents
AI agents are shifting from JSON-based MCP tool schemas to writing and running real code. Here's what the code-execution movement means for your stack โ and when to use each app...
Why Your AI Agent Needs Control Flow, Not More Prompts
Why Your AI Agent Needs Control Flow, Not More Prompts
Prompt-driven agents break at scale. Here's how LangGraph, OpenAI Agents SDK, and Pydantic AI handle deterministic control flow โ and when each approach actually works.
Building Production AI Agents in 2026: The Infrastructure Stack
Building Production AI Agents in 2026: The Infrastructure Stack
Aide-Memory, Agent-desktop, Spec27, and SlopIt โ four new tools that solve real problems in the AI agent development lifecycle. Here's what they do and when to use them.
AI Coding Agents Generate Code Fast โ But Who Maintains It?
AI Coding Agents Generate Code Fast โ But Who Maintains It?
Your AI coding agent doubles output but may double maintenance costs too. Here's how to evaluate agents by code quality, not just speed โ with real tools and workflows.