AI Agents in Production — What Actually Works After 6 Months
After running autonomous agents on real projects for 6 months: the patterns that survive contact with production, the ones that die in week one, and the guardrails that actually help.
AI Agents in Production — What Actually Works After 6 Months
Last Updated: May 2026
Six months ago we started running AI agents on production workloads — not demos, not prototypes. Real engineering tasks: code reviews, test generation, documentation updates, dependency updates, and customer support triage.
Here's what's still running and what we shut down, plus the patterns that actually hold up.
What We Ran
Three types of agents:
1. Code review agent — Reviews PRs for logic errors, security issues, and test coverage. Runs on every PR via GitHub Actions.
2. Documentation agent — Monitors code changes, flags missing docs, and drafts updates for review.
3. Dependency update agent — Weekly scan of outdated dependencies, drafts PRs with changelog summaries.
What Survived
Code Review Agent (Still Running)
The code review agent is the highest-ROI agent we've deployed. Every week it catches 3–5 real issues that would have shipped. It runs on every PR, posts findings as GitHub comments, and has a false positive rate of about 20% (acceptable — humans verify).
What made it work:
- •Narrow, well-defined scope (review logic + security, not style)
- •Sandboxed execution — it can read code but can't write to the repo without human approval
- •Grounding in actual project context via OpenClaw's codebase memory
Why it didn't die: The feedback loop is tight. Engineers see a comment, they agree or disagree in 30 seconds, the agent learns from corrections via human feedback.
Dependency Update Agent (Still Running)
Weekly dependency scan. When it finds outdated packages, it:
1. Drafts a PR with the update
2. Runs the test suite
3. Posts a comment: "Updated X to Y. Test suite: PASSED / FAILED."
Takes 12 minutes per weekly run. A human doing this manually would take 45–60 minutes. That's 3 hours/month reclaimed. The agent makes one mistake per month (usually a minor version conflict it couldn't anticipate). Worth it.
What Died in Week One
Documentation Agent (Cancelled Week 2)
The agent was supposed to monitor code changes and draft documentation updates. It was too ambitious from the start:
- •It generated documentation that was technically accurate but tonally wrong (AI-sounding, not like how our engineers write)
- •It had no sense of what mattered to document and what didn't (documented trivial getters, missed actual complexity)
- •The review overhead to fix its output was higher than just writing the docs ourselves
The lesson: Agents that need to exercise editorial judgment are not ready for production unsupervised. Agents that execute well-defined verification tasks? Working fine.
Customer Support Triage Agent (Cancelled Week 1)
We tried routing support tickets through an agent that would classify intent and draft responses. Three problems:
1. Latency: 8-second response time on support tickets is unacceptable. Customers expect 30 seconds.
2. Handoff friction: When the agent couldn't resolve something (which was often), the handoff to a human required the human to read the entire agent conversation — more work than just handling it from scratch.
3. Liability: Shipping AI-generated support responses without a human in the loop created compliance concerns we hadn't anticipated.
We pivoted to using the agent to suggest responses, not send them. A human reviews and sends. Better outcome, still saves time.
Patterns That Hold Up
Guardrails That Actually Work
1. Execution timeout (hard limit)
Every agent task gets a max duration. We use 5 minutes for simple tasks, 20 minutes for complex ones. After timeout, the agent's partial output is captured, a human is notified, and the task is marked incomplete.
Without this, agents can loop on ambiguous tasks indefinitely — burning compute and producing nothing.
2. Output schema enforcement
When the agent produces structured output (e.g., a code review finding), we validate it against a JSON schema before accepting it. If validation fails, the agent gets one retry with the error message. Two failures → task marked as "needs human review."
This alone reduced our false positive rate by 60%.
3. Human-in-the-loop for writes
Agents can read everything. For writing actions (PR creation, comment posting, API calls that modify state), we require explicit human approval via GitHub Actions.
The exception: dependency updates. We've now run 52 dependency update PRs without a human approving each one. After 3 months with zero incidents, we graduated it to fully automated.
4. Sandboxed network access
Our agents run with an allowlist of IP ranges for external calls. They can call our internal APIs and approved external services (GitHub, Slack). They cannot call arbitrary external endpoints without explicit allowlist addition.
This isn't paranoia — we've had one agent attempt to call an analytics endpoint it discovered in the codebase during a code review. The sandbox blocked it.
The Honest Assessment
AI agents in production are real and valuable today — but only for narrowly scoped, well-measured tasks. The gap between "agent that sounds impressive in a demo" and "agent that reliably ships value without babysitting" is still wide.
The teams winning with agents aren't building one super-agent. They're building 5–10 narrow agents with tight scope, clear success metrics, and human oversight that actually gets removed once the agent proves itself.
That's not less impressive than the super-agent vision. It's just more honest.
Share this article
About NeuralStackly Engineering
Expert researcher and writer at NeuralStackly, dedicated to finding the best AI tools to boost productivity and business growth.
View all postsRelated Articles
Continue reading with these related posts
OpenClaw vs CrewAI vs DeerFlow — Agent Framework Showdown
OpenClaw vs CrewAI vs DeerFlow — Agent Framework Showdown
Three production-ready agent frameworks go head-to-head on setup time, MCP support, sandboxing, and enterprise readiness. The tradeoffs that actually matter.
AI Agents in Production 2026: What Actually Breaks and How to Fix It
AI Agents in Production 2026: What Actually Breaks and How to Fix It
Real-world failures deploying AI agents in 2026. Tool calling loops, context truncation, permission escalation, and the patterns that actually hold up under load.
OpenClaw Revolution: How Local-First AI Agents Are Transforming the Digital Workplace
OpenClaw Revolution: How Local-First AI Agents Are Transforming the Digital Workplace
OpenClaw has exploded to over 250,000 GitHub stars, becoming the fastest-growing open-source project ever. Here's why local-first AI agents are reshaping how we think about priv...

Alibaba Qwen3.5 Unleashes AI Agents as China chatbot race intensifies
Alibaba releases Qwen3.5 with native agentic capabilities, supporting 201 languages and positioning China for global AI dominance. The model is compatible with open-source agent...
OpenClaw: The Wild West of AI Agents Is Here, And Security Experts Are Worried
OpenClaw: The Wild West of AI Agents Is Here, And Security Experts Are Worried
The open-source AI agent that can autonomously browse, book, and buy is thrilling hobbyists but terrifying cybersecurity experts. Here's why the 'no-rules' approach has the indu...