Why Your AI Agent Needs Control Flow, Not More Prompts

Last Updated: May 2026

If you've ever written MANDATORY: DO NOT SKIP THIS STEP in a system prompt, you've already lost. Your agent architecture is broken, and no amount of prompt engineering will fix it.

A post trending at 500+ points on Hacker News this week made the argument clearly: reliable agents tackling complex tasks need deterministic control flow encoded in software, not increasingly elaborate prompt chains. The thesis is right, but it stops short of the practical question — which frameworks actually implement this correctly, and what are the tradeoffs?

We tested three of the most popular agent frameworks — LangGraph, OpenAI Agents SDK, and Pydantic AI — specifically on how they handle control flow, state management, and error recovery. Here's what we found.

The Problem With Prompt-Driven Agents

Most agent demos follow the same pattern: a system prompt instructs the LLM to follow a multi-step process, then the LLM chains tool calls together in a single generation or a series of unstructured turns. This works for toy tasks. It falls apart in production for three reasons:

1. Silent failures compound. When an LLM skips a validation step or hallucinates a tool output, there's no stack trace. The agent continues with bad state, and by the time you notice the output is wrong, you can't reconstruct which step failed.

2. Prompts aren't composable. Software scales through modules with well-defined interfaces. Prompt chains scale by adding more words to the prompt. Context windows fill up, instructions contradict each other, and adding a new step means re-testing every existing step.

3. No recovery path. When a prompt-driven agent hits an error, your options are: retry the entire run, add more instructions to the prompt (making it worse), or babysit it with a human in the loop.

The fix isn't better prompts. It's treating the LLM as a component inside a deterministic control flow scaffold.

What Deterministic Control Flow Actually Means

In traditional software, control flow is obvious: conditionals, loops, state machines, try/catch blocks. The code decides what happens next based on explicit state, not natural language interpretation.

For AI agents, this means:

•Explicit state transitions — your code decides which step runs next, not the LLM
•Validation checkpoints — structured outputs are validated before the next step executes
•Conditional branching — if/else based on parsed tool outputs, not prompt instructions
•Retry with bounds — failed LLM calls retry with a counter, not by re-prompting "try harder"
•Error isolation — one step's failure doesn't corrupt the entire pipeline's state

This isn't theoretical. Let's look at how the three most popular frameworks implement it.

LangGraph (31,500+ GitHub Stars)

LangGraph is the most explicit about control flow. You define your agent as a directed graph where nodes are functions (LLM calls, tool executions, validators) and edges are transitions (conditionals, fixed paths).

Control flow model: State graph with conditional edges.

from langgraph.graph import StateGraph, END

graph = StateGraph(AgentState)
graph.add_node("research", research_step)
graph.add_node("validate", validate_step)
graph.add_node("synthesize", synthesize_step)

graph.add_edge("research", "validate")
graph.add_conditional_edges("validate", should_retry, {
    True: "research",
    False: "synthesize"
})
graph.add_edge("synthesize", END)

Where it excels:

•Complex multi-step workflows where the path depends on intermediate results
•Built-in checkpointing — state is persisted between steps, so you can resume after failures
•Supports cycles (retry loops) without prompt engineering
•TypeScript and Python support

Where it struggles:

•Learning curve is steep — graph definitions are verbose
•Overkill for simple single-agent tasks
•The abstraction sometimes fights you when you need something that doesn't fit a graph pattern

Best for: Multi-agent workflows, research pipelines, any task where the path through steps is genuinely conditional and complex.

OpenAI Agents SDK (26,000+ GitHub Stars)

OpenAI's Agents SDK takes a more structured approach. Agents are defined as objects with explicit instructions, tools, and handoff points. Control flow is managed through agent handoffs — one agent can delegate to another with structured context.

Control flow model: Agent handoffs with guardrails.

from agents import Agent, Runner

researcher = Agent(
    name="Researcher",
    instructions="Find and extract relevant information.",
    tools=[search_tool, extract_tool],
    output_type=ResearchOutput,
)

validator = Agent(
    name="Validator",
    instructions="Verify the research findings.",
    output_type=ValidationResult,
)

result = await Runner.run(researcher, input="Research topic X")

Where it excels:

•Simple, clean API — easy to get started
•Structured output validation built in (via Pydantic)
•Guardrails let you define input/output validation as code, not prompts
•First-party OpenAI integration (obviously)

Where it struggles:

•Locked into OpenAI's model ecosystem (Claude, Gemini require adapters)
•Handoff-based flow is less flexible than LangGraph's graph model
•Fewer built-in patterns for complex retry/cycle logic

Best for: Teams already on OpenAI's stack who want structured, guardrailed agents without the graph abstraction overhead.

Pydantic AI (17,000+ GitHub Stars)

Pydantic AI takes a different philosophical approach: agents are just typed Python functions with LLM calls. The control flow is whatever Python code you write, with Pydantic models providing type safety and validation at every step.

Control flow model: Standard Python control flow with typed LLM calls.

from pydantic_ai import Agent

research_agent = Agent('openai:gpt-4o', result_type=ResearchResult)

@research_agent.tool
async def search(ctx, query: str) -> SearchResult:
    results = await search_api(query)
    return SearchResult(items=results)

async def run_pipeline(topic: str):
    # Standard Python control flow — no graph abstraction
    research = await research_agent.run(f"Research {topic}")

    if research.data.confidence < 0.7:
        # Explicit retry logic, not a prompt
        research = await research_agent.run(
            f"Research {topic} with more depth",
            message_history=research.all_messages()
        )

    return synthesize(research.data)

Where it excels:

•Lowest learning curve — if you know Python, you know Pydantic AI
•Type safety at every boundary (inputs, outputs, tool results)
•Model-agnostic — works with OpenAI, Anthropic, Gemini, Ollama
•Easy to test — agents are just async functions

Where it struggles:

•No built-in checkpointing or state persistence
•You have to build your own retry/recovery logic (it's just Python, but you still have to write it)
•Less opinionated means more rope to hang yourself with on complex workflows

Best for: Developers who want maximum flexibility and type safety, teams mixing multiple LLM providers, and agents that fit naturally into existing Python codebases.

Decision Framework: Which One Should You Use?

Don't pick based on star count. Pick based on your actual requirements:

Requirement	LangGraph	OpenAI Agents SDK	Pydantic AI
Complex multi-step workflows	✅ Best	⚠️ Good	⚠️ Good
Model provider flexibility	✅ Yes	❌ OpenAI-first	✅ Yes
Built-in state persistence	✅ Checkpointing	⚠️ Limited	❌ Roll your own
Learning curve	🔴 Steep	🟢 Easy	🟢 Easy
Type safety / validation	⚠️ Good	✅ Good	✅ Best
Multi-agent orchestration	✅ Native	✅ Handoffs	⚠️ Manual
Production observability	✅ LangSmith	⚠️ Basic	⚠️ Basic

Our recommendation:

1. If your agent has 5+ conditional steps and needs persistence → LangGraph. The graph model is worth the learning curve.

2. If you're on OpenAI models and want something production-ready fast → OpenAI Agents SDK. The guardrails API alone saves weeks.

3. If you want model flexibility and type safety in existing Python code → Pydantic AI. It's the least framework-y framework.

The Anti-Patterns to Avoid

Regardless of which framework you choose, these patterns will kill your agent's reliability:

Anti-pattern 1: Giant system prompts with numbered steps. If your prompt reads like a manual, you've moved your logic to the wrong layer. Steps 1-8 belong in code, not in a string that the LLM can ignore.

Anti-pattern 2: "Vibe-based" error handling. If your error handling is "if the output looks wrong, try again," you don't have error handling. Use structured outputs with validation schemas. Fail fast. Retry with bounds.

Anti-pattern 3: God agents. One agent that does everything — research, validation, synthesis, formatting — will produce worse results than a pipeline of specialized agents, each with a narrow scope. The LLM performs better when it has one job.

Anti-pattern 4: No observability. If you can't trace which step failed, why it failed, and what the LLM's raw output was before validation, you're flying blind. LangSmith and similar tools exist for a reason.

What This Means for Your Stack

The shift from prompt-driven to control-flow-driven agents is the same shift the industry made from "AI completes your code" to "AI agents run in a sandbox with permissions." It's about making AI behavior predictable, debuggable, and composable.

If you're building agents today and your architecture is "big prompt + tool calls," stop. Pick a framework that gives you explicit control flow. The upfront complexity is real, but the alternative — debugging why your agent silently skipped step 4 and hallucinated the output of step 7 — is worse.

The tools and frameworks we've covered here are all tracked and compared on NeuralStackly's agent frameworks page. If you're evaluating which agent stack fits your team's workflow, the benchmarks and comparison tools can help you make the call with real data instead of marketing claims.

For teams deploying agents in production, also check out our guides on agent observability and agent evaluation — because control flow without monitoring is just a faster way to fail silently.