Ivan's Inferences #4
Running Deep-Agent in Production:Lessons from the Trenches
I've spent the past few months running deepagents in production, and I'll be honest—it was more expensive and complex than I expected. If you haven't heard of it, deepagents draws heavy inspiration from Claude Code—Anthropic's popular CLI tool that's proven remarkably effective at complex, multi-step tasks. The good news? If you're just getting started with v1, you can skip most of my painful learning curve. Here's what I wish I knew from day one.
Why Deep-Agent Over Supervisors?
If you've built multi-agent systems before, you've probably tried the supervisor pattern—one LLM routing tasks to specialized agents. It works, but it has problems:
- Shallow reasoning: Supervisors make surface-level routing decisions without deep task understanding
- Context fragmentation: Each agent call starts fresh, losing continuity
- Manual orchestration: You're writing the coordination logic yourself
deepagents takes a different approach. It uses a hierarchical planner that breaks down complex tasks, maintains context across subagent calls via a filesystem backend, and handles the orchestration automatically. It also includes a built-in todo tool (inspired by Claude Code) that lets the agent track its own progress through multi-step tasks—something supervisors can't do well. The result is more coherent outputs with less boilerplate code.
I started with a supervisor pattern, spent weeks debugging coordination issues, then switched to deepagents. The upgrade to v1 was painful (breaking changes), but the architecture is fundamentally better. If you're starting fresh, just use v1 and skip the pain.
The Expensive Reality Check
Deep agents aren't like simple chat applications. A single task might:
- Make 10-50+ LLM calls across orchestrator and subagents
- Take 30 seconds to several minutes
- Cost $1-10+ depending on complexity
My first production bill was... eye-opening. But after a lot of iteration, I've found patterns that work.
1. Don't Block Your API—Use Workers
This was my first mistake. I invoked the agent directly in my API handler. Timeouts everywhere.
The fix is simple: use a message queue (Redis Streams, SQS, whatever you have) and dedicated worker processes.
API Request → Queue → Worker Process
↓ ↓
"Task queued" Agent executesYour API returns immediately, workers pick up tasks and run them in the background. Scale workers horizontally as needed. Handle poison messages by acknowledging failures to prevent infinite retries.
2. Control Your Costs (Before They Control You)
Here's the insight that saved me the most money: not all LLM calls need reasoning models.
The orchestrator just routes tasks and tracks state—that's pattern matching, not deep thinking. Use a cheap model (gpt-4o-mini) for orchestration and save the reasoning models (o3-mini) for subagents that do the actual work.
| Role | Recommended Model | Cost/1M tokens |
|---|---|---|
| Orchestrator | gpt-4o-mini | ~$0.15 |
| Worker/Critique Agents | o3-mini | ~$1.10 |
Use feature flags to control model selection per user or organization. This lets you A/B test, roll out gradually, and quickly rollback if costs spike.
Other cost controls that helped:
- Set
max_tokenslimits on every call - Cache web search and knowledge base results
- Skip re-processing if you already have high-confidence results
- Add output length constraints to your prompts
- Rate limit per user/org
Use the Filesystem Backend
This one's important: deepagents v1 supports a filesystem backend that writes intermediate results to files instead of keeping everything in context. This dramatically reduces token usage on complex tasks. If you're generating substantial intermediate artifacts, this alone can cut costs significantly.
3. Web Search Isn't Enough
For enterprise use cases, public web search misses too much:
- Internal knowledge and policies
- Who knows what (relationship context)
- Meeting discussions and decisions
Build a unified tool that queries multiple sources in parallel—documents, chat history, people directory, calendar, etc. Use Promise.allSettled so one slow or failing source doesn't block the others.
Quick Wins
Architecture:
- Separate API from workers—never run expensive agent work in request handlers
- Add circuit breakers for flaky tools/APIs
- Use structured outputs for reliability
User Experience:
- Send progress updates during long tasks ("Gathering context...", "Processing...")
- Save intermediate state so crashed tasks can resume
- Monitor your queue depth—if it's growing, you need more workers
The Bottom Line
Running deep agents in production is more like running a data pipeline than a chat app. You need:
- Async processing: Workers, queues, background execution
- Cost controls: Tiered models, caching, rate limits, filesystem backend
- Rich context: Custom tools beyond web search
- Observability: Track tokens, costs, and latency per call
The deepagents v1 library handles a lot of the complexity I had to build myself. If you're starting fresh, you're in a much better position than I was. Just remember: treat your deep agent like the expensive, powerful tool it is. Give it proper infrastructure from the start, and you'll avoid the painful lessons I learned the hard way.
