Agent Model Routing: When Small Models Beat Frontier Mode...

If you're building an AI agent that makes 100 LLM calls per task, and every call goes to Claude Opus at $15/MTok, you're spending $30-50 per complex task. But here's the thing: most of those calls don't need a frontier model.

Tool calling, JSON extraction, intent classification, input validation — these are tasks where a 1B-parameter model often matches a 200B-parameter model at 1/100th the cost. The trick is knowing when to route down and when to stay on the frontier.

This isn't theory. A team at Cactus Compute just open-sourced Needle, a 26M-parameter model distilled from Gemini 3.1 specifically for function calling. It runs at 6,000 tokens/second prefill on consumer hardware and beats models 10x its size on single-shot tool calling. That project hit 688 points on Hacker News this week because it crystallized something developers have been feeling: we're overpaying for most agent steps.

Here's a practical framework for model routing in production agents.

The 80/20 of Agent LLM Calls

Most agent workflows follow a predictable pattern:

1. Parse user input → understand intent, extract parameters

2. Plan → decide which tools to call, in what order

3. Execute → call tools, parse results

4. Validate → check outputs, handle errors

5. Respond → synthesize a final answer

Steps 1, 3, and 4 are structured tasks. They need reliable JSON output, not creative reasoning. Steps 2 and 5 are where frontier models earn their price tag.

In a typical 10-step agent loop, maybe 2-3 steps genuinely need GPT-5 or Claude Opus. The other 7-8 are routing, formatting, and validation that run fine on GPT-4.1-mini, Claude Haiku, or even a local model.

Cost Comparison: Routing vs. Single-Model

Let's look at real numbers. Say you're building a coding agent that:

•Takes a bug report as input
•Plans a fix strategy
•Reads 5 files (tool calls)
•Writes a patch (tool call)
•Runs tests (tool call)
•Validates output (structured check)
•Writes a summary

All-Opus approach (Claude Opus 4.7):

•~50K tokens per task (input + output across all steps)
•$15/MTok input, $75/MTok output
•Estimated cost: $0.75 - $1.20 per task
•Latency: 30-60 seconds

Routed approach (Opus for planning + Haiku for execution):

•Opus: ~8K tokens (planning + final synthesis) → $0.12
•Haiku: ~42K tokens (tool calls, parsing, validation) → $0.02
•Total: $0.14 per task
•Latency: 15-25 seconds

That's a 5-8x cost reduction and 2x speed improvement. Over 10,000 tasks per month, you save $6,000-10,000. For a startup or small team, that's the difference between "AI feature we can ship" and "AI feature we can't afford."

What Small Models Are Actually Good At

Not all small models are equal. Here's what works well at the 1B-8B scale:

Reliable tasks for small models:

•Function calling with known schemas. If your tools have well-defined JSON schemas, a small model can fill in parameters reliably. This is exactly what Needle was designed for.
•Intent classification. "Is this a bug report, feature request, or question?" — a 3B model gets this right 95%+ of the time.
•Output formatting. Converting free-text into structured JSON, extracting fields, normalizing data.
•Simple validation. "Did the API response contain an error? Is the output valid JSON? Are all required fields present?"
•Cache-friendly retrieval. If you're doing RAG and the retrieval + formatting step doesn't need reasoning, route it down.

Tasks that still need frontier models:

•Multi-step planning with dependencies. "Read the database schema, then decide which tables to join, then write a query that handles edge cases." This needs real reasoning.
•Code generation with context. Writing production code that fits into an existing codebase requires understanding conventions, patterns, and constraints that small models miss.
•Error recovery. When something goes wrong and the agent needs to debug itself, that's frontier territory.
•Ambiguous user intent. When the user's request is unclear and the agent needs to ask clarifying questions or make judgment calls.

Architecture Patterns for Model Routing

Here are three patterns I've seen work in production:

1. The Cascade

User input → Small model (classify intent)
    ├─ Simple task → Small model handles entirely
    └─ Complex task → Route to frontier model

This is the simplest pattern. Use a small model as a traffic cop. If the task is "list my recent commits," a small model can format and execute a git log command. If the task is "refactor the auth module to use OAuth2," route up.

Best for: Agents with a wide range of task complexity, where 60%+ of tasks are simple.

2. The Sandwich

Small model (parse input)
  → Frontier model (plan + reason)
  → Small model (execute tool calls)
  → Small model (validate output)
  → Frontier model (synthesize response)

The frontier model does the thinking. Small models handle the bookkeeping. This is the most common pattern for coding agents and research agents.

Best for: Multi-step agents where the planning step is expensive but execution is mechanical.

3. The Router + Specialist

Router model (tiny, fast)
  ├─ Intent A → Specialist model A (fine-tuned for task A)
  ├─ Intent B → Specialist model B (fine-tuned for task B)
  └─ Unknown  → Frontier model (general fallback)

This is what production systems evolve toward. You fine-tune small models for your most common tasks and keep a frontier model as the general-purpose fallback. Companies like Vercel (with their AI SDK) and Anthropic (with their tool-use optimized endpoints) are building infrastructure for exactly this.

Best for: High-volume agents where you have 3-5 well-defined task types that make up 80% of traffic.

Real Tools for Model Routing

You don't need to build routing from scratch. Here's what's available:

OpenRouter — Routes requests to multiple providers. Lets you set fallback chains (try Haiku first, fall back to Sonnet, then Opus). Pay-per-token across providers without managing API keys for each.

Vercel AI SDK — Has built-in model switching. You can define different models for different steps in a generateText or streamText call. Cleanest developer experience for the sandwich pattern.

LiteLLM — Open-source proxy that normalizes 100+ provider APIs into one OpenAI-compatible interface. Add routing rules based on cost, latency, or model capability.

Ollama — For the router + specialist pattern, run small models locally (Qwen 3.5 1.5B, Gemma 4 4B) and call frontier models via API only when needed. Zero cost for 80% of your calls.

LangSmith — Not a router, but essential for routing. Track which model handled each step, what it cost, and whether it succeeded. Without observability, routing is just guessing.

The Local Option: Running Small Models on Developer Machines

The Needle project proved something important: a 26M-parameter model can run at 6,000 tok/s on a laptop. That's not a typo. For structured tasks, you don't even need a GPU.

Here's what local models can handle for agent workflows in 2026:

Task	Model	Size	Speed	Cost
Function calling	Needle	26M	6K tok/s	Free
Intent classification	Qwen 3.5 Small	1.5B	200+ tok/s	Free
JSON extraction	Gemma 4	4B	100+ tok/s	Free
Simple code completion	DeepSeek Coder V2 Lite	2.4B	80+ tok/s	Free
Tool result parsing	Phi-4 Mini	3.8B	100+ tok/s	Free

All of these run on a MacBook Air. No GPU required.

The trade-off is clear: for $0, you get 90-95% reliability on structured tasks. For $0.14, you get 99%+ reliability using cloud small models. For $1.20, you get frontier reasoning. Pick based on your error budget.

When Routing Goes Wrong

Model routing isn't free. Here are the failure modes:

Misrouting complex tasks to small models. If your classifier sends a "simple" query to Haiku, but it's actually a multi-step reasoning task, you get wrong answers. The cost of a wrong answer (especially in coding or data pipelines) can exceed the savings.

Latency from the routing step itself. Adding a classification step before every call adds 50-200ms. If your agent is already fast (under 2 seconds), the overhead matters. Cache the router or use a hardcoded rule set for known patterns.

Maintenance burden. Every new task type means updating your routing logic, possibly fine-tuning a new specialist model, and updating fallback chains. This is a real engineering cost that offsets API savings.

Testing sprawl. When you have 4 models handling different steps, you need integration tests for every combination. A change in Haiku's JSON formatting can break your tool-calling pipeline even though the "smart" part works fine.

The practical advice: start with two tiers (frontier + small). Only add more tiers when you have data showing that one tier is clearly mispriced for a specific task type.

How to Evaluate Your Current Agent

Run this quick audit:

1. Log every LLM call for a week. Include model, input tokens, output tokens, latency, and success/failure.

2. Classify each call as "needs reasoning" or "structured task." Be honest — if the model is just extracting JSON fields, it's structured.

3. Calculate the split. If more than 40% of your calls are structured tasks hitting a frontier model, you have a routing opportunity.

4. Test a small model on the structured calls. Take 100 real examples and run them through Haiku, GPT-4.1-mini, or a local model. Measure accuracy.

5. Calculate savings. If accuracy holds above your threshold, the savings are real and immediate.

For most coding agents, you'll find that 50-70% of calls are structured. The frontier model is being used as a very expensive regex engine.

Bottom Line

The AI agent stack is maturing past "throw GPT-5 at everything." The economics don't work for high-volume production agents. Model routing — using the smallest model that gets the job done for each step — is becoming standard practice.

Start simple. Two tiers. Measure everything. Route based on data, not vibes.

You can compare model pricing and capabilities across providers on our benchmarks page, or explore the full tool catalog to find the right model provider for your agent stack. If you're evaluating agent frameworks, the agent frameworks comparison covers routing support in LangGraph, CrewAI, Mastra, and others.

The teams that figure out model routing now will have a compounding cost advantage as agent usage scales. It's not about being cheap — it's about spending your AI budget where it actually matters.