Skip to main content
EngineeringMay 5, 20265 min

AI Agents in Production — What Actually Works After 6 Months

After running autonomous agents on real projects for 6 months: the patterns that survive contact with production, the ones that die in week one, and the guardrails that actually help.

NeuralStackly Engineering
Author
Journal

AI Agents in Production — What Actually Works After 6 Months

AI Agents in Production — What Actually Works After 6 Months

Last Updated: May 2026

Six months ago we started running AI agents on production workloads — not demos, not prototypes. Real engineering tasks: code reviews, test generation, documentation updates, dependency updates, and customer support triage.

Here's what's still running and what we shut down, plus the patterns that actually hold up.

What We Ran

Three types of agents:

1. Code review agent — Reviews PRs for logic errors, security issues, and test coverage. Runs on every PR via GitHub Actions.

2. Documentation agent — Monitors code changes, flags missing docs, and drafts updates for review.

3. Dependency update agent — Weekly scan of outdated dependencies, drafts PRs with changelog summaries.

What Survived

Code Review Agent (Still Running)

The code review agent is the highest-ROI agent we've deployed. Every week it catches 3–5 real issues that would have shipped. It runs on every PR, posts findings as GitHub comments, and has a false positive rate of about 20% (acceptable — humans verify).

What made it work:

  • Narrow, well-defined scope (review logic + security, not style)
  • Sandboxed execution — it can read code but can't write to the repo without human approval
  • Grounding in actual project context via OpenClaw's codebase memory

Why it didn't die: The feedback loop is tight. Engineers see a comment, they agree or disagree in 30 seconds, the agent learns from corrections via human feedback.

Dependency Update Agent (Still Running)

Weekly dependency scan. When it finds outdated packages, it:

1. Drafts a PR with the update

2. Runs the test suite

3. Posts a comment: "Updated X to Y. Test suite: PASSED / FAILED."

Takes 12 minutes per weekly run. A human doing this manually would take 45–60 minutes. That's 3 hours/month reclaimed. The agent makes one mistake per month (usually a minor version conflict it couldn't anticipate). Worth it.

What Died in Week One

Documentation Agent (Cancelled Week 2)

The agent was supposed to monitor code changes and draft documentation updates. It was too ambitious from the start:

  • It generated documentation that was technically accurate but tonally wrong (AI-sounding, not like how our engineers write)
  • It had no sense of what mattered to document and what didn't (documented trivial getters, missed actual complexity)
  • The review overhead to fix its output was higher than just writing the docs ourselves

The lesson: Agents that need to exercise editorial judgment are not ready for production unsupervised. Agents that execute well-defined verification tasks? Working fine.

Customer Support Triage Agent (Cancelled Week 1)

We tried routing support tickets through an agent that would classify intent and draft responses. Three problems:

1. Latency: 8-second response time on support tickets is unacceptable. Customers expect 30 seconds.

2. Handoff friction: When the agent couldn't resolve something (which was often), the handoff to a human required the human to read the entire agent conversation — more work than just handling it from scratch.

3. Liability: Shipping AI-generated support responses without a human in the loop created compliance concerns we hadn't anticipated.

We pivoted to using the agent to suggest responses, not send them. A human reviews and sends. Better outcome, still saves time.

Patterns That Hold Up

Guardrails That Actually Work

1. Execution timeout (hard limit)

Every agent task gets a max duration. We use 5 minutes for simple tasks, 20 minutes for complex ones. After timeout, the agent's partial output is captured, a human is notified, and the task is marked incomplete.

Without this, agents can loop on ambiguous tasks indefinitely — burning compute and producing nothing.

2. Output schema enforcement

When the agent produces structured output (e.g., a code review finding), we validate it against a JSON schema before accepting it. If validation fails, the agent gets one retry with the error message. Two failures → task marked as "needs human review."

This alone reduced our false positive rate by 60%.

3. Human-in-the-loop for writes

Agents can read everything. For writing actions (PR creation, comment posting, API calls that modify state), we require explicit human approval via GitHub Actions.

The exception: dependency updates. We've now run 52 dependency update PRs without a human approving each one. After 3 months with zero incidents, we graduated it to fully automated.

4. Sandboxed network access

Our agents run with an allowlist of IP ranges for external calls. They can call our internal APIs and approved external services (GitHub, Slack). They cannot call arbitrary external endpoints without explicit allowlist addition.

This isn't paranoia — we've had one agent attempt to call an analytics endpoint it discovered in the codebase during a code review. The sandbox blocked it.

The Honest Assessment

AI agents in production are real and valuable today — but only for narrowly scoped, well-measured tasks. The gap between "agent that sounds impressive in a demo" and "agent that reliably ships value without babysitting" is still wide.

The teams winning with agents aren't building one super-agent. They're building 5–10 narrow agents with tight scope, clear success metrics, and human oversight that actually gets removed once the agent proves itself.

That's not less impressive than the super-agent vision. It's just more honest.

Share this article

N

About NeuralStackly Engineering

Expert researcher and writer at NeuralStackly, dedicated to finding the best AI tools to boost productivity and business growth.

View all posts

Related Articles

Continue reading with these related posts