Ariso

Ivan's Inferences #4

Running Deep-Agent in Production:Lessons from the Trenches

8 min read
AIDeep AgentsEngineeringProduction

I've spent the past few months running deepagents in production, and I'll be honest—it was more expensive and complex than I expected. If you haven't heard of it, deepagents draws heavy inspiration from Claude Code—Anthropic's popular CLI tool that's proven remarkably effective at complex, multi-step tasks. The good news? If you're just getting started with v1, you can skip most of my painful learning curve. Here's what I wish I knew from day one.

Why Deep-Agent Over Supervisors?

If you've built multi-agent systems before, you've probably tried the supervisor pattern—one LLM routing tasks to specialized agents. It works, but it has problems:

  • Shallow reasoning: Supervisors make surface-level routing decisions without deep task understanding
  • Context fragmentation: Each agent call starts fresh, losing continuity
  • Manual orchestration: You're writing the coordination logic yourself

deepagents takes a different approach. It uses a hierarchical planner that breaks down complex tasks, maintains context across subagent calls via a filesystem backend, and handles the orchestration automatically. It also includes a built-in todo tool (inspired by Claude Code) that lets the agent track its own progress through multi-step tasks—something supervisors can't do well. The result is more coherent outputs with less boilerplate code.

I started with a supervisor pattern, spent weeks debugging coordination issues, then switched to deepagents. The upgrade to v1 was painful (breaking changes), but the architecture is fundamentally better. If you're starting fresh, just use v1 and skip the pain.

The Expensive Reality Check

Deep agents aren't like simple chat applications. A single task might:

  • Make 10-50+ LLM calls across orchestrator and subagents
  • Take 30 seconds to several minutes
  • Cost $1-10+ depending on complexity

My first production bill was... eye-opening. But after a lot of iteration, I've found patterns that work.

1. Don't Block Your API—Use Workers

This was my first mistake. I invoked the agent directly in my API handler. Timeouts everywhere.

The fix is simple: use a message queue (Redis Streams, SQS, whatever you have) and dedicated worker processes.

API Request → Queue → Worker Process ↓ ↓ "Task queued" Agent executes

Your API returns immediately, workers pick up tasks and run them in the background. Scale workers horizontally as needed. Handle poison messages by acknowledging failures to prevent infinite retries.

2. Control Your Costs (Before They Control You)

Here's the insight that saved me the most money: not all LLM calls need reasoning models.

The orchestrator just routes tasks and tracks state—that's pattern matching, not deep thinking. Use a cheap model (gpt-4o-mini) for orchestration and save the reasoning models (o3-mini) for subagents that do the actual work.

RoleRecommended ModelCost/1M tokens
Orchestratorgpt-4o-mini~$0.15
Worker/Critique Agentso3-mini~$1.10

Use feature flags to control model selection per user or organization. This lets you A/B test, roll out gradually, and quickly rollback if costs spike.

Other cost controls that helped:

  • Set max_tokens limits on every call
  • Cache web search and knowledge base results
  • Skip re-processing if you already have high-confidence results
  • Add output length constraints to your prompts
  • Rate limit per user/org

Use the Filesystem Backend

This one's important: deepagents v1 supports a filesystem backend that writes intermediate results to files instead of keeping everything in context. This dramatically reduces token usage on complex tasks. If you're generating substantial intermediate artifacts, this alone can cut costs significantly.

3. Web Search Isn't Enough

For enterprise use cases, public web search misses too much:

  • Internal knowledge and policies
  • Who knows what (relationship context)
  • Meeting discussions and decisions

Build a unified tool that queries multiple sources in parallel—documents, chat history, people directory, calendar, etc. Use Promise.allSettled so one slow or failing source doesn't block the others.

Quick Wins

Architecture:

  • Separate API from workers—never run expensive agent work in request handlers
  • Add circuit breakers for flaky tools/APIs
  • Use structured outputs for reliability

User Experience:

  • Send progress updates during long tasks ("Gathering context...", "Processing...")
  • Save intermediate state so crashed tasks can resume
  • Monitor your queue depth—if it's growing, you need more workers

The Bottom Line

Running deep agents in production is more like running a data pipeline than a chat app. You need:

  1. Async processing: Workers, queues, background execution
  2. Cost controls: Tiered models, caching, rate limits, filesystem backend
  3. Rich context: Custom tools beyond web search
  4. Observability: Track tokens, costs, and latency per call

The deepagents v1 library handles a lot of the complexity I had to build myself. If you're starting fresh, you're in a much better position than I was. Just remember: treat your deep agent like the expensive, powerful tool it is. Give it proper infrastructure from the start, and you'll avoid the painful lessons I learned the hard way.


Shawn Zhu is the founding engineer and head of AIOps at Ariso. Previously worked in IBM, LifeOmic and FountainLife. He co-created Zori AI medical assistant.

Who's Ivan?

Who's Ivan you ask? He writes half our code, and he's free to use. Check it out on npm or github.