Skip to main content
AI AgentsMay 10, 20268 min

MCP Tool Calling vs Code Execution: The New Battleground for AI Agents

AI agents are shifting from JSON-based MCP tool schemas to writing and running real code. Here's what the code-execution movement means for your stack β€” and when to use each approach.

NeuralStackly
Author
Journal

MCP Tool Calling vs Code Execution: The New Battleground for AI Agents

MCP Tool Calling vs Code Execution: The New Battleground for AI Agents

Every AI agent framework in 2026 solves the same problem: how does the model do things? For the last year, the answer was MCP β€” JSON schemas that describe tools the model can call. But a new pattern is emerging fast: skip the schema, give the agent a code interpreter, and let it write actual programs to accomplish tasks.

This isn't theoretical. Three projects represent different bets on where agent tooling goes next:

  • β€’Zapcode β€” a Rust-based sandboxed TypeScript interpreter designed as an alternative to MCP tool calling, with 2Β΅s cold starts
  • β€’Pydantic DeepAgents β€” a Python framework building Claude Code–style agents with tool-calling, sandboxed Docker execution, and multi-agent teams
  • β€’Devcontainer MCP β€” an MCP server that lets agents like Claude Code and Copilot work directly inside dev containers

Together they represent a spectrum: from "MCP but better" to "MCP is the wrong abstraction entirely."

The Problem With JSON Tool Schemas

MCP (Model Context Protocol) works like this: you define a tool as a JSON schema with a name, description, and typed parameters. The model sees the schema, decides to call the tool, and your server executes it. Clean, predictable, safe.

Until it isn't.

The friction shows up in three places:

1. The schema wall. Every tool the agent might use needs a schema. For simple operations β€” look up a user, send a message β€” this works. But for complex workflows, you end up with dozens of tools, each with overlapping parameters. The model has to pick the right tool and fill the right parameters. More tools means more confusion and worse tool selection.

2. Composability is zero. MCP tools are atomic. An agent can't combine "search the database" with "format the results as CSV" with "upload to S3" in a single tool call. It has to make three sequential calls, each one a potential failure point, each one burning tokens on the context window. For an agent doing real work β€” analyzing a dataset, debugging a production issue, migrating a schema β€” this sequential model is painfully slow.

3. The schema drift problem. When an API changes, your MCP tool schema breaks silently. The model calls the tool with stale parameters and gets errors. There's no type system, no compiler, no linting. Just a JSON object that used to work and now doesn't. A recent study from arxiv (DELEGATE-52, April 2026) found that current LLMs introduced errors in 52 professional document-editing domains when delegating tasks β€” the model thought it was using the right tool correctly, but the output was corrupted.

The Code-Execution Alternative

The code-execution approach inverts the model. Instead of calling pre-defined tools, the agent writes a program. That program runs in a sandboxed environment. The output comes back to the model.

This is what Claude Code does when it edits your files. It's what Devin does when it builds features. It's what your terminal agent does when it runs shell commands. The agent isn't calling a tool β€” it's writing code and executing it.

Zapcode takes this to its logical extreme. It's a TypeScript interpreter written in Rust, designed specifically for AI agents. The pitch: 2-microsecond cold starts, fully sandboxed execution, no Docker overhead. Instead of defining a tool schema, you let the agent write TypeScript and run it. The interpreter handles the security boundary.

This solves all three MCP problems:

  • β€’No schema wall. The agent can write any code it needs. No pre-defined tools to manage.
  • β€’Native composability. It's code. Import libraries, chain operations, handle errors. The agent writes a program, not a sequence of tool calls.
  • β€’Type safety. TypeScript catches entire classes of errors before execution. The model gets compiler feedback, not runtime errors.

But Here's the Catch

Code execution has its own failure modes, and they're serious:

Security is harder. A JSON tool schema can't delete your database. Arbitrary code can. Sandboxing helps, but sandbox escapes are a regular occurrence. You need real isolation β€” Docker containers, WASM runtimes, or something like Zapcode's Rust-based interpreter. Each adds latency and complexity.

Determinism is weaker. JSON tool calls are predictable. Code execution depends on the runtime environment, installed packages, file system state, and network access. Two executions of the "same" agent code can produce different results if the environment differs.

Cost is higher. Code execution burns more tokens β€” the model has to write actual programs, not just fill in parameters. And the execution itself uses compute. For simple operations (look up a user, send an email), MCP is 10-50x cheaper.

Debugging is a nightmare. When an MCP tool call fails, you can see the schema, the parameters, and the error. When an agent's code fails, you're debugging a program you didn't write, in a sandbox you can't easily inspect.

The Hybrid Pattern (What Actually Works in Production)

After talking to teams running agents in production and tracking the agent framework landscape, the pattern that survives contact with reality is hybrid:

Use MCP for well-defined operations. Database queries, API calls, file reads, sending messages. These are operations where the input/output is predictable, the schema is stable, and the risk is low. MCP's JSON schemas are actually good for this β€” they constrain the model's behavior in useful ways.

Use code execution for complex workflows. Data analysis, multi-step transformations, debugging sessions, anything that requires iteration or branching. The agent writes code, runs it, reads the output, and decides what to do next. This is where code execution shines β€” the agent can adapt, retry, and compose operations in ways that MCP tool chains can't.

Use containerized sandboxes for anything that touches the real world. Pydantic DeepAgents (780+ stars as of May 2026) uses Docker sandboxes for agent execution. Devcontainer MCP gives agents a full dev container environment. Both approaches treat the execution environment as disposable β€” if the agent breaks something, you throw it away and start fresh.

When to Use What: A Decision Framework

FactorMCP Tool CallingCode Execution
Operation complexitySingle, well-defined actionsMulti-step, branching workflows
Composability neededLow (one tool, one job)High (chain, branch, iterate)
Security sensitivityLow (read-heavy, idempotent)Medium (needs sandboxing)
Cost sensitivityHigh (optimize per-call cost)Medium (accept higher per-task cost)
Determinism requiredHigh (exact same input β†’ exact same output)Medium (some variance acceptable)
Schema stabilityHigh (API rarely changes)Low (agent adapts to changes)

Good MCP candidates: Slack message sending, database reads, GitHub issue creation, file reads, API lookups.

Good code-execution candidates: Data pipeline transformations, debugging sessions, multi-repository code search, test generation with execution validation, infrastructure provisioning scripts.

What This Means for Your Stack

If you're building AI agents today, you need both. Not because it's trendy, but because the failure modes are different and complementary:

1. Start with MCP for your core tool set. Define schemas for the 10-20 operations your agent needs most. Keep schemas narrow and well-typed. Use the MCP Inspector (9,700+ stars) for visual testing during development.

2. Add code execution for complex tasks. Use a sandboxed interpreter β€” Zapcode for TypeScript, Docker containers via Pydantic DeepAgents for Python, or a dev container via Devcontainer MCP for full environment isolation.

3. Instrument everything. Whether the agent calls a tool or runs code, you need observability. Track tool selection accuracy, code execution success rates, token usage per operation, and error patterns. Use agent observability tools to catch regressions before they hit production.

4. Version your tools. The DELEGATE-52 paper showed that models corrupt documents when tool schemas drift from reality. Treat your MCP tool definitions like code β€” version them, test them, and roll back when they break. For AI code review workflows, this is table stakes.

The Trend: Code-First Agents Are Winning

The most capable agents shipping today β€” Claude Code, OpenAI Codex, Devin β€” all use code execution as their primary action mechanism. They don't call tools. They write programs. MCP is still there for integration (connecting to external services), but the thinking happens in code.

This is the direction. The question isn't "MCP or code execution?" It's "where do I use each?" The answer is based on complexity, security, and cost β€” not ideology.

If you're evaluating agent frameworks for your team, check our AI agent framework comparison and coding agent benchmarks. For the full LLM provider landscape β€” including which providers support structured output, tool calling, and code execution β€” see our LLM API provider guide.

The agents that ship in the second half of 2026 will be code-first with MCP for integration. Build your stack accordingly.

Share this article

N

About NeuralStackly

Expert researcher and writer at NeuralStackly, dedicated to finding the best AI tools to boost productivity and business growth.

View all posts

Related Articles

Continue reading with these related posts