Why Your AI Agent Needs Control Flow, Not More Prompts
Prompt-driven agents break at scale. Here's how LangGraph, OpenAI Agents SDK, and Pydantic AI handle deterministic control flow — and when each approach actually works.
Why Your AI Agent Needs Control Flow, Not More Prompts
Last Updated: May 2026
If you've ever written MANDATORY: DO NOT SKIP THIS STEP in a system prompt, you've already lost. Your agent architecture is broken, and no amount of prompt engineering will fix it.
A post trending at 500+ points on Hacker News this week made the argument clearly: reliable agents tackling complex tasks need deterministic control flow encoded in software, not increasingly elaborate prompt chains. The thesis is right, but it stops short of the practical question — which frameworks actually implement this correctly, and what are the tradeoffs?
We tested three of the most popular agent frameworks — LangGraph, OpenAI Agents SDK, and Pydantic AI — specifically on how they handle control flow, state management, and error recovery. Here's what we found.
The Problem With Prompt-Driven Agents
Most agent demos follow the same pattern: a system prompt instructs the LLM to follow a multi-step process, then the LLM chains tool calls together in a single generation or a series of unstructured turns. This works for toy tasks. It falls apart in production for three reasons:
1. Silent failures compound. When an LLM skips a validation step or hallucinates a tool output, there's no stack trace. The agent continues with bad state, and by the time you notice the output is wrong, you can't reconstruct which step failed.
2. Prompts aren't composable. Software scales through modules with well-defined interfaces. Prompt chains scale by adding more words to the prompt. Context windows fill up, instructions contradict each other, and adding a new step means re-testing every existing step.
3. No recovery path. When a prompt-driven agent hits an error, your options are: retry the entire run, add more instructions to the prompt (making it worse), or babysit it with a human in the loop.
The fix isn't better prompts. It's treating the LLM as a component inside a deterministic control flow scaffold.
What Deterministic Control Flow Actually Means
In traditional software, control flow is obvious: conditionals, loops, state machines, try/catch blocks. The code decides what happens next based on explicit state, not natural language interpretation.
For AI agents, this means:
- •Explicit state transitions — your code decides which step runs next, not the LLM
- •Validation checkpoints — structured outputs are validated before the next step executes
- •Conditional branching —
if/elsebased on parsed tool outputs, not prompt instructions - •Retry with bounds — failed LLM calls retry with a counter, not by re-prompting "try harder"
- •Error isolation — one step's failure doesn't corrupt the entire pipeline's state
This isn't theoretical. Let's look at how the three most popular frameworks implement it.
LangGraph (31,500+ GitHub Stars)
LangGraph is the most explicit about control flow. You define your agent as a directed graph where nodes are functions (LLM calls, tool executions, validators) and edges are transitions (conditionals, fixed paths).
Control flow model: State graph with conditional edges.
from langgraph.graph import StateGraph, END
graph = StateGraph(AgentState)
graph.add_node("research", research_step)
graph.add_node("validate", validate_step)
graph.add_node("synthesize", synthesize_step)
graph.add_edge("research", "validate")
graph.add_conditional_edges("validate", should_retry, {
True: "research",
False: "synthesize"
})
graph.add_edge("synthesize", END)
Where it excels:
- •Complex multi-step workflows where the path depends on intermediate results
- •Built-in checkpointing — state is persisted between steps, so you can resume after failures
- •Supports cycles (retry loops) without prompt engineering
- •TypeScript and Python support
Where it struggles:
- •Learning curve is steep — graph definitions are verbose
- •Overkill for simple single-agent tasks
- •The abstraction sometimes fights you when you need something that doesn't fit a graph pattern
Best for: Multi-agent workflows, research pipelines, any task where the path through steps is genuinely conditional and complex.
OpenAI Agents SDK (26,000+ GitHub Stars)
OpenAI's Agents SDK takes a more structured approach. Agents are defined as objects with explicit instructions, tools, and handoff points. Control flow is managed through agent handoffs — one agent can delegate to another with structured context.
Control flow model: Agent handoffs with guardrails.
from agents import Agent, Runner
researcher = Agent(
name="Researcher",
instructions="Find and extract relevant information.",
tools=[search_tool, extract_tool],
output_type=ResearchOutput,
)
validator = Agent(
name="Validator",
instructions="Verify the research findings.",
output_type=ValidationResult,
)
result = await Runner.run(researcher, input="Research topic X")
Where it excels:
- •Simple, clean API — easy to get started
- •Structured output validation built in (via Pydantic)
- •Guardrails let you define input/output validation as code, not prompts
- •First-party OpenAI integration (obviously)
Where it struggles:
- •Locked into OpenAI's model ecosystem (Claude, Gemini require adapters)
- •Handoff-based flow is less flexible than LangGraph's graph model
- •Fewer built-in patterns for complex retry/cycle logic
Best for: Teams already on OpenAI's stack who want structured, guardrailed agents without the graph abstraction overhead.
Pydantic AI (17,000+ GitHub Stars)
Pydantic AI takes a different philosophical approach: agents are just typed Python functions with LLM calls. The control flow is whatever Python code you write, with Pydantic models providing type safety and validation at every step.
Control flow model: Standard Python control flow with typed LLM calls.
from pydantic_ai import Agent
research_agent = Agent('openai:gpt-4o', result_type=ResearchResult)
@research_agent.tool
async def search(ctx, query: str) -> SearchResult:
results = await search_api(query)
return SearchResult(items=results)
async def run_pipeline(topic: str):
# Standard Python control flow — no graph abstraction
research = await research_agent.run(f"Research {topic}")
if research.data.confidence < 0.7:
# Explicit retry logic, not a prompt
research = await research_agent.run(
f"Research {topic} with more depth",
message_history=research.all_messages()
)
return synthesize(research.data)
Where it excels:
- •Lowest learning curve — if you know Python, you know Pydantic AI
- •Type safety at every boundary (inputs, outputs, tool results)
- •Model-agnostic — works with OpenAI, Anthropic, Gemini, Ollama
- •Easy to test — agents are just async functions
Where it struggles:
- •No built-in checkpointing or state persistence
- •You have to build your own retry/recovery logic (it's just Python, but you still have to write it)
- •Less opinionated means more rope to hang yourself with on complex workflows
Best for: Developers who want maximum flexibility and type safety, teams mixing multiple LLM providers, and agents that fit naturally into existing Python codebases.
Decision Framework: Which One Should You Use?
Don't pick based on star count. Pick based on your actual requirements:
| Requirement | LangGraph | OpenAI Agents SDK | Pydantic AI |
|---|---|---|---|
| Complex multi-step workflows | ✅ Best | ⚠️ Good | ⚠️ Good |
| Model provider flexibility | ✅ Yes | ❌ OpenAI-first | ✅ Yes |
| Built-in state persistence | ✅ Checkpointing | ⚠️ Limited | ❌ Roll your own |
| Learning curve | 🔴 Steep | 🟢 Easy | 🟢 Easy |
| Type safety / validation | ⚠️ Good | ✅ Good | ✅ Best |
| Multi-agent orchestration | ✅ Native | ✅ Handoffs | ⚠️ Manual |
| Production observability | ✅ LangSmith | ⚠️ Basic | ⚠️ Basic |
Our recommendation:
1. If your agent has 5+ conditional steps and needs persistence → LangGraph. The graph model is worth the learning curve.
2. If you're on OpenAI models and want something production-ready fast → OpenAI Agents SDK. The guardrails API alone saves weeks.
3. If you want model flexibility and type safety in existing Python code → Pydantic AI. It's the least framework-y framework.
The Anti-Patterns to Avoid
Regardless of which framework you choose, these patterns will kill your agent's reliability:
Anti-pattern 1: Giant system prompts with numbered steps. If your prompt reads like a manual, you've moved your logic to the wrong layer. Steps 1-8 belong in code, not in a string that the LLM can ignore.
Anti-pattern 2: "Vibe-based" error handling. If your error handling is "if the output looks wrong, try again," you don't have error handling. Use structured outputs with validation schemas. Fail fast. Retry with bounds.
Anti-pattern 3: God agents. One agent that does everything — research, validation, synthesis, formatting — will produce worse results than a pipeline of specialized agents, each with a narrow scope. The LLM performs better when it has one job.
Anti-pattern 4: No observability. If you can't trace which step failed, why it failed, and what the LLM's raw output was before validation, you're flying blind. LangSmith and similar tools exist for a reason.
What This Means for Your Stack
The shift from prompt-driven to control-flow-driven agents is the same shift the industry made from "AI completes your code" to "AI agents run in a sandbox with permissions." It's about making AI behavior predictable, debuggable, and composable.
If you're building agents today and your architecture is "big prompt + tool calls," stop. Pick a framework that gives you explicit control flow. The upfront complexity is real, but the alternative — debugging why your agent silently skipped step 4 and hallucinated the output of step 7 — is worse.
The tools and frameworks we've covered here are all tracked and compared on NeuralStackly's agent frameworks page. If you're evaluating which agent stack fits your team's workflow, the benchmarks and comparison tools can help you make the call with real data instead of marketing claims.
For teams deploying agents in production, also check out our guides on agent observability and agent evaluation — because control flow without monitoring is just a faster way to fail silently.
Share this article
About NeuralStackly
Expert researcher and writer at NeuralStackly, dedicated to finding the best AI tools to boost productivity and business growth.
View all postsRelated Articles
Continue reading with these related posts
MCP Tool Calling vs Code Execution: The New Battleground for AI Agents
MCP Tool Calling vs Code Execution: The New Battleground for AI Agents
AI agents are shifting from JSON-based MCP tool schemas to writing and running real code. Here's what the code-execution movement means for your stack — and when to use each app...
OpenClaw vs CrewAI vs DeerFlow — Agent Framework Showdown
OpenClaw vs CrewAI vs DeerFlow — Agent Framework Showdown
Three production-ready agent frameworks go head-to-head on setup time, MCP support, sandboxing, and enterprise readiness. The tradeoffs that actually matter.
AI Agents in Production — What Actually Works After 6 Months
AI Agents in Production — What Actually Works After 6 Months
After running autonomous agents on real projects for 6 months: the patterns that survive contact with production, the ones that die in week one, and the guardrails that actually...
Claude Sonnet 4.6: Anthropic's Mid-Tier Model Now Matches Flagship Opus at One-Fifth the Cost
Claude Sonnet 4.6: Anthropic's Mid-Tier Model Now Matches Flagship Opus at One-Fifth the Cost
Anthropic's Claude Sonnet 4.6 delivers near-Opus performance across coding, computer use, and agentic tasks while costing 80% less. The new default model features a 1M token con...

Alibaba Qwen3.5 Unleashes AI Agents as China chatbot race intensifies
Alibaba releases Qwen3.5 with native agentic capabilities, supporting 201 languages and positioning China for global AI dominance. The model is compatible with open-source agent...