GitHub pages
A practical guide to multi-agent AI architecture, drawn from building a production system.
Everyone’s talking about AI agents. Most tutorials show you a single agent calling a few tools. But real-world problems — the kind businesses pay to solve — require multiple agents working together. And that’s where things get hard.
I’ve spent the past several months building CMUX, a multi-agent orchestration system where AI agents coordinate to complete complex software engineering tasks autonomously. Agents write code, review each other’s work, run tests, and recover from failures — all without human intervention.
This post covers what I’ve learned about making multi-agent AI systems that don’t just demo well, but actually work in production.
A single LLM agent hits a wall fast. Context windows fill up. Long-running tasks lose coherence. One bad tool call derails the entire chain of thought.
Multi-agent systems solve this by decomposing work:
The trade-off is coordination complexity. You need infrastructure to manage agent lifecycles, route messages, handle failures, and maintain shared state. That infrastructure is what separates a demo from a system.
The most effective pattern I’ve found is a supervisor-worker hierarchy. It mirrors how engineering teams actually work:
User Request
↓
Supervisor Agent (plans, delegates, reviews)
↓
┌────────────┬────────────┬────────────┐
│ Worker 1 │ Worker 2 │ Worker 3 │
│ (Backend) │ (Frontend) │ (Tests) │
└────────────┴────────────┴────────────┘
The supervisor:
Workers are disposable. They spin up, complete a focused task, and shut down. The supervisor persists and maintains the big picture.
Here’s a simplified version of how CMUX spawns a worker:
async def spawn_worker(agent_id: str, task: str, project_path: str):
"""Spawn a worker agent in an isolated tmux window."""
session = "cmux"
# Create isolated window for the worker
subprocess.run([
"tmux", "new-window", "-t", session, "-n", agent_id
])
# Send the task context and start the agent
context_file = write_worker_context(agent_id, task, project_path)
subprocess.run([
"tmux", "send-keys", "-t", f"{session}:{agent_id}",
f"claude --resume '{context_file}'", "Enter"
])
return {"agent_id": agent_id, "status": "IN_PROGRESS"}
Each worker gets its own tmux window — a real terminal with full shell access. This is important. Agents need to run commands, edit files, and see output. Sandboxing them in a Python subprocess limits what they can do.
Agents need to communicate, but direct agent-to-agent communication creates a tangled mess. Instead, use a centralized message bus.
CMUX uses a file-based mailbox — dead simple and surprisingly robust:
# Worker reports completion
./tools/mailbox done "Implemented user authentication endpoint. Tests passing."
# Worker reports a blocker
./tools/mailbox blocked "Database schema migration requires supervisor approval."
# Supervisor sends task to worker
./tools/mailbox send worker-backend "Add rate limiting to /api/auth endpoint"
A router daemon polls the mailbox and delivers messages to the right agent:
#!/bin/bash
# Simplified message router
while true; do
if [ -s "$MAILBOX_FILE" ]; then
while IFS='|' read -r timestamp sender recipient type body; do
case "$recipient" in
supervisor)
tmux send-keys -t "cmux:supervisor" "$body" Enter
;;
worker-*)
tmux send-keys -t "cmux:$recipient" "$body" Enter
;;
esac
done < "$MAILBOX_FILE"
> "$MAILBOX_FILE" # Clear processed messages
fi
sleep 2
done
This pattern keeps agents decoupled. Workers don’t need to know about each other. The supervisor is the only agent that sees the full picture.
Here’s what every multi-agent tutorial skips: things break constantly. Agents hallucinate file paths. They write code with syntax errors. They get stuck in loops. They misunderstand their task.
You need automated recovery at multiple levels:
Give agents the ability to detect and fix their own mistakes. This means running tests, checking compilation, and validating output before reporting success.
# In the worker's task instructions
WORKER_INSTRUCTIONS = """
TESTING IS MANDATORY. Before reporting completion:
1. Run the relevant test suite
2. Verify your changes compile/lint cleanly
3. Check that existing tests still pass
4. If anything fails, fix it before reporting done
"""
The supervisor reviews worker output and catches issues that workers miss. If a worker’s code doesn’t integrate cleanly with other changes, the supervisor can reassign the task or spawn a new worker to fix it.
This is the safety net. A health monitor watches the entire system and takes action when things go wrong:
#!/bin/bash
# Health monitor - polls system health, rolls back on failure
FAIL_COUNT=0
MAX_FAILURES=3
while true; do
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8000/api/webhooks/health)
if [ "$HTTP_CODE" != "200" ]; then
FAIL_COUNT=$((FAIL_COUNT + 1))
if [ "$FAIL_COUNT" -ge "$MAX_FAILURES" ]; then
echo "System unhealthy. Rolling back..."
git stash
git reset --hard HEAD~1
# Rebuild and restart
uv sync && cd src/frontend && npm run build && cd ../..
./src/orchestrator/cmux.sh restart
FAIL_COUNT=0
fi
else
FAIL_COUNT=0
fi
sleep 10
done
This is critical for self-modifying systems. Agents modify the codebase they’re running on. If a change breaks the system, the monitor detects it and rolls back automatically. The agents can experiment aggressively because the safety net catches failures.
LLM agents lose all context when their session ends. For a multi-agent system that runs continuously, you need persistent memory that survives across sessions.
CMUX uses a layered memory system:
Journal entries — Structured logs of what each agent did, why, and what they learned. Stored as daily markdown files.
Auto-memory files — Key facts, patterns, and corrections that are loaded into every new agent session automatically.
Project registry — Metadata about what projects exist, their status, and which agents are assigned.
# Journal entry structure
{
"title": "Implemented rate limiting",
"content": """
## What was done
Added token bucket rate limiter to /api/auth endpoints.
## Why
Production logs showed brute force attempts.
## Key decisions
- Used token bucket over sliding window (simpler, good enough)
- 10 requests/minute per IP
## Issues encountered
- Redis dependency rejected — used in-memory with file backup instead
""",
"tags": ["backend", "security", "rate-limiting"]
}
The journal is the system’s long-term memory. When a new supervisor session starts, it reads recent journal entries to understand what’s been happening. When a worker encounters a problem, it can search past entries for similar issues and solutions.
This is the difference between a system that starts from zero every time and one that accumulates institutional knowledge.
After months of iteration, here are the patterns that survived:
Use real terminals, not sandboxed execution. Agents running in tmux windows can do everything a developer can. They can install packages, run build tools, use git, and debug interactively. Sandboxed execution environments are safer but too limiting for complex tasks.
Event-driven over polling. Use webhooks and file watchers instead of periodic status checks wherever possible. Polling wastes tokens and adds latency.
Prefer simplicity. A file-based mailbox beats a message queue for systems under 20 agents. SQLite beats Postgres when you’re running on a single machine. Don’t add infrastructure until you need it.
Make agents disposable. Workers should be cheap to create and destroy. Don’t try to maintain long-running worker sessions — context degrades over time. Spawn fresh agents for each task.
Log everything. Every agent action, every message, every decision should be logged. When something goes wrong (and it will), you need to trace exactly what happened.
Over-engineering coordination. Start with direct supervisor-to-worker delegation. Add complexity (consensus protocols, voting, negotiation) only when you have evidence you need it.
Trusting agent output without verification. Always validate. Run tests. Check that files exist. Verify that APIs return expected responses. Agents are confident even when wrong.
Ignoring context window limits. A multi-agent system generates a lot of text. Without active context management (compaction, summarization, selective loading), agents degrade as their context fills up.
No rollback strategy. If your agents modify shared state (files, databases, APIs), you need a way to undo changes when things go wrong. Git for code, database transactions for data, feature flags for deployments.
If you’re building a multi-agent AI system, start simple:
The hard part isn’t the AI — it’s the infrastructure. Agent coordination, error recovery, and persistent memory are what separate a toy from a tool.
I build multi-agent AI systems and automation tools for businesses. If you need help designing or implementing an AI agent architecture, check out my services or get in touch.