Multi-Agent Systems: When One Brain Isn't Enough
In AI 05, we built a single agent: one LLM, one loop, one goal. It could query databases, calculate figures, and generate reports. But some problems are too complex for one agent — just as some projects are too large for one engineer.
A financial audit requires a researcher who pulls the data, an analyst who identifies anomalies, a writer who drafts the report, and a reviewer who validates the conclusions. These are different competencies — different tools, different system prompts, different specializations. Cramming all of them into a single agent creates a generalist that does everything adequately but nothing excellently.
Multi-Agent Systems split complex tasks across specialized agents that collaborate. The architecture mirrors how high-performance human teams work: clear roles, defined handoffs, and a coordinator who keeps the whole operation aligned.
TL;DR: Multi-agent systems solve the specialization and parallelization problems that single agents can’t. One orchestrator coordinates multiple specialist agents — each with their own tools, prompts, and expertise. This article covers the four collaboration patterns, how to build them with LangGraph, CrewAI, and AutoGen, the critical cost explosion problem, and the anti-patterns that turn agent teams into chaos.
⚠️ Freshness Warning: Multi-agent frameworks evolve rapidly. This article focuses on architecture patterns that remain stable. Verify framework-specific APIs against current documentation.
┌──────────────────────────────────────────────────────────┐
│ Supervisor Pattern (most common) │
│ │
│ ┌──────────────┐ │
│ │ Orchestrator │ ← receives goal, routes │
│ │ (Agent 0) │ tasks, collects results │
│ └──────┬───────┘ │
│ ┌─────────┼─────────┐ │
│ ┌─────▼───┐ ┌───▼────┐ ┌──▼──────┐ │
│ │Researcher│ │Analyst │ │Reviewer │ │
│ │(Agent 1) │ │(Agent 2)│ │(Agent 3)│ │
│ │tools: │ │tools: │ │tools: │ │
│ │web_search│ │calc │ │validate │ │
│ │db_query │ │chart │ │send_msg │ │
│ └─────────┘ └────────┘ └─────────┘ │
│ │
│ Each agent: own system prompt, own tools, own LLM │
│ Orchestrator: routes tasks, enforces HITL approvals │
└──────────────────────────────────────────────────────────┘
Article Map
I — Theory Layer (why multi-agent?)
- Why Multi-Agent? — The specialization argument
- Collaboration Patterns — Supervisor, P2P, Hierarchical, Blackboard
II — Architecture Layer (how it’s built) 3. Communication Protocols — Messages, handoffs, shared memory 4. Agent-to-Agent (A2A) Protocol — Google’s inter-agent standard
III — Engineering Layer (building and operating) 5. Building Multi-Agent Systems — LangGraph, CrewAI, AutoGen 6. The Cost Explosion Problem — Super-linear scaling, budgets 7. Agent Evaluation: The Judge-LLM Pattern — Automated quality control 8. Challenges & Anti-Patterns — What kills production systems 9. Key Takeaways — Decision framework
1. Why Multi-Agent? The Specialization Argument
1.1 The Single-Agent Ceiling
A single agent has three fundamental limitations:
Single-Agent Limitations:
1. CONTEXT WINDOW CEILING
All tools, all history, all state must fit in one context window.
A financial audit agent needs:
- DB tools (query, insert, validate)
- Web tools (search, fetch)
- Communication tools (email, Slack)
- Document tools (PDF, Excel)
- 50+ steps of history
→ Total: 150K+ tokens → expensive and error-prone
2. COGNITIVE OVERLOAD
One LLM tries to be researcher, analyst, writer, AND reviewer.
"Jack of all trades, master of none."
→ Each role needs different system prompts
→ Switching roles within one prompt degrades quality
3. SEQUENTIAL EXECUTION
All steps must happen in one long chain.
→ Can't parallelize: research + analysis + review simultaneously
→ A 3-hour job stays 3 hours instead of 1 hour with 3 agents
1.2 The Multi-Agent Solution
Multi-Agent vs. Single Agent:
SINGLE AGENT:
User Goal → [One LLM handles everything: research + analyze + write + review]
→ Long chain, large context, sequential, fragile
MULTI-AGENT:
User Goal
│
▼
Orchestrator: "This needs research first."
│
├──→ Researcher Agent: web_search + db_query → raw data
│
├──→ Analyst Agent: calculate + chart → insights (parallel!)
│
├──→ Writer Agent: draft report → formatted output (uses insights)
│
└──→ Reviewer Agent: validate + ⏸ HITL → approved final
Each agent:
- Smaller context window → cheaper per call
- Focused system prompt → better quality
- Specialized tools → right tool for each job
- Can run in parallel → faster end-to-end
Connection to AI 00 §6.3: This mirrors the Transformer’s Multi-Head Attention mechanism. In a Transformer, multiple attention heads simultaneously attend to different aspects of the input — one head for syntax, another for semantics, another for long-range dependencies. Multi-agent systems apply the same principle to task execution: multiple specialized agents simultaneously attend to different aspects of a complex problem.
1.3 When Multi-Agent Is (and Isn’t) Worth It
| Condition | Decision |
|---|---|
| Task requires 10+ steps | ✅ Consider multi-agent |
| Task needs clearly distinct expertise (research vs. analysis vs. review) | ✅ Use multi-agent |
| Steps can run in parallel | ✅ Strong case for multi-agent |
| Quality review of AI output is critical | ✅ Use a dedicated reviewer agent |
| Task is simple and well-defined | ❌ Single agent is cheaper/simpler |
| Team size < 3 needed roles | ❌ Single agent with more tools |
| Budget is tight | ❌ Multi-agent costs 3-5× more (see §6) |
2. Four Collaboration Patterns
Multi-agent systems don’t all work the same way. The collaboration pattern determines how agents communicate and coordinate:
2.1 Supervisor Pattern (Orchestrator + Workers)
The most common pattern. A central orchestrator routes tasks to specialist workers and assembles the results.
Supervisor Pattern:
┌────────────────────┐
│ ORCHESTRATOR │
│ │
│ Receives goal │
│ Decomposes task │
│ Routes to agents │
│ Collects results │
│ Returns final │
└─────────┬─────────┘
┌────────────────┼────────────────┐
│ │ │
┌─────▼──────┐ ┌──────▼──────┐ ┌─────▼──────┐
│ Worker A │ │ Worker B │ │ Worker C │
│ Researcher │ │ Analyst │ │ Reviewer │
│ │ │ │ │ │
│ web_search │ │ calculate │ │ validate │
│ db_query │ │ chart_gen │ │ send_email │
└─────────────┘ └─────────────┘ └────────────┘
Communication flow:
Orchestrator → Worker A: "Research Q4 revenue data"
Worker A → Orchestrator: [raw data]
Orchestrator → Worker B: "Analyze this data" + [raw data]
Worker B → Orchestrator: [insights + charts]
Orchestrator → Worker C: "Review and approve" + [draft + insights]
Worker C → Orchestrator: [approved report] OR [revision requests]
Orchestrator → User: [final report]
Best for: Sequential workflows with clear handoffs. Financial reporting, content creation pipelines, code review workflows.
Weakness: Single point of failure — if the orchestrator makes a wrong routing decision, the whole pipeline suffers.
2.2 Peer-to-Peer (Decentralized)
Agents communicate directly with each other, without a central coordinator. Each agent decides who to talk to next.
Peer-to-Peer Pattern:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Agent A │────→│ Agent B │────→│ Agent C │
│Researcher│ │Analyst │ │Writer │
└──────────┘ └──────────┘ └──────────┘
↑ │
└──────────────────────────────────┘
(feedback loop)
Each agent decides:
"I've completed my part. Who needs my output?"
"I need X. Who can provide it?"
Best for: Creative workflows where the “next step” isn’t always predetermined. Brainstorming, exploration, R&D.
Weakness: Hard to audit, hard to debug, easy to create communication loops. Not recommended for production systems that require explainability.
2.3 Hierarchical (Manager → Team → Worker)
Multi-level supervision. A top-level orchestrator manages mid-level managers, who manage specialist workers.
Hierarchical Pattern:
┌─────────────────────────────────────────────────┐
│ VP-LEVEL ORCHESTRATOR │
│ "Prepare the monthly board package" │
└───────────┬─────────────────────┬───────────────┘
│ │
┌─────────▼──────┐ ┌─────────▼──────┐
│ FINANCE TEAM │ │ STRATEGY TEAM │
│ Manager │ │ Manager │
└──┬──────┬──────┘ └──┬──────┬─────┘
│ │ │ │
┌───▼──┐ ┌─▼────┐ ┌───▼──┐ ┌─▼────┐
│Rev │ │COGS │ │Mkt │ │Risk │
│Agent │ │Agent │ │Agent │ │Agent │
└──────┘ └──────┘ └──────┘ └──────┘
→ Scales to large, complex domains
→ Each "team" has its own specialized context
→ Manager agents aggregate and synthesize sub-team results
Best for: Enterprise-scale workflows. Monthly close automation, due diligence, large codebase refactoring.
Weakness: High overhead (many agents = many tokens), complex orchestration, expensive to debug.
2.4 Blackboard Pattern (Shared Workspace)
All agents read and write to a shared “blackboard” (a structured state object). No direct agent-to-agent messages — coordination happens through the shared state.
Blackboard Pattern:
┌─────────────────────────────────────────────┐
│ BLACKBOARD (Shared State) │
│ │
│ goal: "Q4 financial audit" │
│ raw_data: [query results...] │
│ analysis: {insights: [...]} │
│ anomalies: ["Q3 COGS spike +23%"] │
│ draft: "Q4 saw strong revenue..." │
│ review_status: "approved" │
│ final_report: "..." │
└──────┬──────────────┬──────────────┬────────┘
│ │ │
┌────▼────┐ ┌─────▼───┐ ┌─────▼───┐
│Researcher│ │Analyst │ │Writer │
│reads: │ │reads: │ │reads: │
│ goal │ │raw_data │ │analysis │
│writes: │ │writes: │ │writes: │
│ raw_data │ │analysis │ │ draft │
└──────────┘ └─────────┘ └─────────┘
LangGraph implements this naturally — AgentState IS the blackboard.
Each node reads from and writes to the shared state dict.
Best for: LangGraph-based systems, workflows where agents naturally sequence by reading previous results.
Weakness: Requires careful state schema design upfront; complex dependency management.
🔧 Engineer’s Note: In practice, most production multi-agent systems combine patterns. A Supervisor at the top-level routes tasks (Supervisor pattern), while sub-tasks use a Blackboard for state sharing (Blackboard pattern) and a dedicated Reviewer communicates directly with the Writer (limited P2P for feedback). Don’t feel constrained to pick exactly one pattern.
3. Communication Protocols: How Agents Talk
Coordination between agents requires a well-defined communication protocol. There are three mechanisms:
3.1 Message Passing
The simplest form: agents send structured messages to each other.
Message Structure:
{
"from": "researcher_agent",
"to": "analyst_agent",
"type": "task_result",
"content": {
"data": [...query results...],
"metadata": {
"source": "PostgreSQL financials DB",
"rows": 847,
"query_time_ms": 234
}
},
"timestamp": "2024-12-31T09:15:00Z",
"task_id": "monthly-audit-2024-12"
}
Key design decisions:
- Schema: Define strict schemas for messages — unstructured strings lead to misinterpretation
- Routing: Who decides which agent receives the message? (Orchestrator vs. self-routing)
- Failure handling: What happens when the recipient agent fails?
3.2 Shared State (Blackboard)
All agents read and write to a shared state object. No direct messages — coordination is implicit.
# In LangGraph: AgentState IS the shared blackboard
class TeamState(TypedDict):
# Input
goal: str
# Intermediate results (written by specific agents, read by others)
raw_data: Optional[list] # written by: researcher
analysis: Optional[dict] # written by: analyst
draft_report: Optional[str] # written by: writer
review_notes: Optional[list[str]] # written by: reviewer
# Control flow
current_agent: str # which agent should act next
revision_count: int # how many review cycles
# Output
final_report: Optional[str]
approved: bool
3.3 Agent Handoffs
A more structured form of message passing: one agent explicitly hands control to another, along with all relevant context.
Handoff Pattern:
Researcher → Analyst (handoff):
"Here is the data I collected. Your task: identify anomalies.
Context you need:
- Q1-Q4 revenue: [data]
- YoY comparison baseline: [data]
- Industry benchmarks: [reference]
Focus particularly on: Q3 COGS variance (23% spike).
Return: your analysis in structured JSON."
Why handoffs are better than raw message passing:
┌─────────────────┬─────────────────────────────────────┐
│ Raw Message │ Handoff │
├─────────────────┼─────────────────────────────────────┤
│ "Here's data." │ "Here's data + context + your goal │
│ │ + what to focus on + expected │
│ │ output format." │
│ Recipient must │ Recipient has everything it needs. │
│ infer context. │ Less guesswork, fewer errors. │
└─────────────────┴─────────────────────────────────────┘
🔧 Engineer’s Note: The quality of agent handoffs directly determines the quality of the multi-agent system. A sloppy handoff (“here’s the data, do something with it”) forces the downstream agent to re-derive context it shouldn’t need to derive. A precise handoff (“here’s the data, here’s what I found interesting, here’s your specific task, here’s the output format I expect”) feeds the downstream agent exactly what it needs. Invest time designing handoff message templates.
4. The A2A Protocol: Google’s Inter-Agent Standard
So far, we’ve discussed agent communication within a single system. But what about communication between systems — your company’s agent talking to a vendor’s agent, or a finance agent delegating to a specialized compliance agent from a different provider?
This is the problem that A2A (Agent-to-Agent) solves. Announced by Google in April 2025, with backing from Salesforce, SAP, MongoDB, Atlassian, and 50+ partners.
4.1 The Problem A2A Solves
Without A2A:
YourCompany.FinanceAgent ──proprietary format──→ Vendor.TaxAgent
↑
└── Custom integration code required for every agent pair
N finance agents × M tax agents = N×M integrations
With A2A:
YourCompany.FinanceAgent ──A2A protocol──→ Vendor.TaxAgent
A2A is universal
N + M implementations, not N × M
The same M×N → M+N problem that MCP solved for tools,
A2A solves for agent-to-agent communication.
4.2 A2A Core Concepts
A2A Architecture:
AGENT CARD (Discovery)
─────────────────────
Each A2A-capable agent publishes an "Agent Card" at a well-known URL:
https://vendor.com/.well-known/agent.json
{
"name": "TaxCompliance Agent",
"description": "Handles tax calculations, filings, and compliance checks",
"url": "https://vendor.com/a2a/tax-agent",
"capabilities": ["tax_calculation", "form_1099", "vat_europe"],
"authentication": {"type": "oauth2"},
"skills": [
{"name": "calculate_tax", "description": "..."},
{"name": "validate_filing", "description": "..."}
]
}
TASK-BASED INTERACTION
──────────────────────
A2A uses tasks (not function calls) as the fundamental unit:
{
"task_id": "tax-calc-001",
"message": "Calculate Q4 2024 federal tax for revenue $2.042M",
"artifacts": [{ "type": "data", "content": [financial_data] }]
}
→ Agent processes task asynchronously
→ Returns result when complete (not a synchronous call)
STREAMING SUPPORT
─────────────────
Long-running tasks can stream intermediate updates:
Agent → "Task received, starting calculation" (10% progress)
→ "Federal calculation complete" (60% progress)
→ "State calculations complete" (90% progress)
→ "Final tax liability: $542,310" (100% done)
4.3 MCP vs. A2A: Complementary Layers
How MCP (AI 04) and A2A Work Together:
┌─── YOUR AGENT ─────────────────────────────────────────┐
│ │
│ Goal: "Complete Q4 tax filing" │
│ │ │
│ ├── READ local data via MCP │
│ │ └→ MCP Server: PostgreSQL (financials) │
│ │ (AI 04 pattern) │
│ │ │
│ ├── DELEGATE calculation to specialist agent │
│ │ └→ A2A: TaxCompliance Agent (external) │
│ │ (AI 06 new pattern) │
│ │ │
│ └── SEND result via MCP tool │
│ └→ MCP Server: Email (SMTP) │
│ (AI 04 pattern) │
│ │
└────────────────────────────────────────────────────────┘
MCP: Agent ↔ Tools/Data (vertical — environment access)
A2A: Agent ↔ Agent (horizontal — peer collaboration)
They address different integration layers. Both are needed.
Connection to AI 04 §11.5: The Protocol Standards War between Anthropic (MCP) and Google (A2A) discussed in AI 04 is the strategic backdrop. In practice, production systems will likely use both: MCP for tool/data access during agent execution, and A2A for cross-system agent collaboration. Neither protocol eliminates the need for the other.
5. Building Multi-Agent Systems
Three major frameworks dominate multi-agent development. Each reflects a different philosophy:
5.1 Framework Comparison
| Framework | Philosophy | Best For | Control |
|---|---|---|---|
| LangGraph | State machine — explicit graph of nodes and edges | Production, complex workflows, audit trails | High |
| CrewAI | Role-based — define “agents” like employees, “tasks” like job descriptions | Rapid prototyping, role-heavy workflows | Medium |
| AutoGen | Conversation-driven — agents converse until task is complete | Research, exploratory multi-agent setups | Low |
5.2 LangGraph: Multi-Agent with Full Control
LangGraph extends naturally from single-agent (AI 05) to multi-agent by making the graph itself a supervisor that routes between agent sub-graphs.
# Multi-agent financial audit system in LangGraph
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import MemorySaver
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
from typing import TypedDict, Annotated, Optional, Literal
from langgraph.graph.message import add_messages
# ─── Shared Team State (the Blackboard) ──────────────────
class AuditTeamState(TypedDict):
messages: Annotated[list, add_messages]
goal: str
raw_data: Optional[str] # from Researcher
analysis: Optional[str] # from Analyst
draft_report: Optional[str] # from Writer
review_notes: Optional[list[str]] # from Reviewer
final_report: Optional[str]
approved: bool
revision_count: int
next_agent: str # routing control
# ─── Agent Factory ────────────────────────────────────────
def make_agent(role: str, system_prompt: str, tools: list):
"""Create a specialist agent with a specific role."""
llm = ChatAnthropic(model="claude-sonnet-4-20250514")
if tools:
llm = llm.bind_tools(tools)
def agent_fn(state: AuditTeamState) -> AuditTeamState:
context = state.get("raw_data", "") or ""
analysis = state.get("analysis", "") or ""
draft = state.get("draft_report", "") or ""
# Build role-specific prompt with available context
user_msg = f"""Goal: {state['goal']}
Available context:
{f'Raw data: {context[:2000]}' if context else ''}
{f'Analysis: {analysis[:2000]}' if analysis else ''}
{f'Draft: {draft[:2000]}' if draft else ''}
Your task as {role}: Complete your part of this audit."""
messages = [
SystemMessage(content=system_prompt),
HumanMessage(content=user_msg)
] + state["messages"][-3:] # Last 3 messages for context
response = llm.invoke(messages)
return {"messages": [response]}
return agent_fn
# ─── Define Specialist Agents ─────────────────────────────
researcher_agent = make_agent(
role="Researcher",
system_prompt="""You are a data researcher. Your job:
1. Query the financial database for relevant data
2. Always LIMIT queries to 50 rows (data truncation best practice)
3. Return structured data with source metadata
4. When done, write your findings to state""",
tools=[query_database, fetch_schema]
)
analyst_agent = make_agent(
role="Financial Analyst",
system_prompt="""You are a financial analyst. Your job:
1. Analyze the raw data provided
2. Calculate growth rates, margins, variances
3. Flag anomalies (variances > 10%)
4. Return structured insights with confidence levels""",
tools=[calculate, generate_chart]
)
writer_agent = make_agent(
role="Report Writer",
system_prompt="""You are a financial report writer. Your job:
1. Transform analysis into a clear, professional report
2. Structure: Executive Summary → Key Findings → Anomalies → Recommendations
3. Use precise numbers and avoid vague language
4. Format in Markdown""",
tools=[generate_report]
)
reviewer_agent = make_agent(
role="Quality Reviewer",
system_prompt="""You are a quality reviewer. Your job:
1. Verify all numbers are accurate and consistent
2. Check that anomalies are properly flagged
3. Ensure recommendations are actionable
4. Return: 'APPROVED' OR specific revision requests""",
tools=[validate_numbers]
)
# ─── Orchestrator Node ────────────────────────────────────
def orchestrator(state: AuditTeamState) -> AuditTeamState:
"""Routes tasks to the appropriate specialist agent."""
# State-based routing logic
if not state.get("raw_data"):
return {"next_agent": "researcher"}
if not state.get("analysis"):
return {"next_agent": "analyst"}
if not state.get("draft_report"):
return {"next_agent": "writer"}
if not state.get("approved", False):
if state.get("revision_count", 0) >= 3:
# Max revisions reached — escalate to human
return {"next_agent": "human_review"}
return {"next_agent": "reviewer"}
return {"next_agent": "end"}
# ─── Researcher Node (with state update) ──────────────────
def run_researcher(state: AuditTeamState) -> AuditTeamState:
result = researcher_agent(state)
# Extract data from agent's response
data = extract_data_from_response(result["messages"][-1])
return {**result, "raw_data": data}
# ─── Reviewer Node (handles approval logic) ───────────────
def run_reviewer(state: AuditTeamState) -> AuditTeamState:
result = reviewer_agent(state)
response_text = result["messages"][-1].content
if "APPROVED" in response_text.upper():
return {**result, "approved": True, "final_report": state["draft_report"]}
else:
# Extract revision notes and send back to writer
notes = parse_revision_notes(response_text)
return {
**result,
"review_notes": notes,
"draft_report": None, # Clear draft for rewrite
"revision_count": state.get("revision_count", 0) + 1,
}
# ─── Routing function ──────────────────────────────────────
def route_after_orchestrator(state: AuditTeamState) -> str:
return state.get("next_agent", "end")
# ─── Build the Graph ──────────────────────────────────────
audit_workflow = StateGraph(AuditTeamState)
# Add nodes
audit_workflow.add_node("orchestrator", orchestrator)
audit_workflow.add_node("researcher", run_researcher)
audit_workflow.add_node("analyst", lambda s: {**analyst_agent(s), "analysis": extract_analysis(analyst_agent(s)["messages"][-1])})
audit_workflow.add_node("writer", lambda s: {**writer_agent(s), "draft_report": extract_draft(writer_agent(s)["messages"][-1])})
audit_workflow.add_node("reviewer", run_reviewer)
# Set entry point
audit_workflow.set_entry_point("orchestrator")
# Orchestrator routes to any agent
audit_workflow.add_conditional_edges(
"orchestrator",
route_after_orchestrator,
{
"researcher": "researcher",
"analyst": "analyst",
"writer": "writer",
"reviewer": "reviewer",
"end": END,
}
)
# All agents return to orchestrator
for agent in ["researcher", "analyst", "writer", "reviewer"]:
audit_workflow.add_edge(agent, "orchestrator")
# Compile with checkpointing
memory = MemorySaver()
audit_app = audit_workflow.compile(checkpointer=memory)
5.3 CrewAI: Role-Based Multi-Agent
CrewAI uses a higher-level abstraction: you define Agents (roles) and Tasks (jobs), and the framework handles coordination.
from crewai import Agent, Task, Crew, Process
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-sonnet-4-20250514")
# ─── Define Agents (Roles) ────────────────────────────────
researcher = Agent(
role="Financial Data Researcher",
goal="Gather complete and accurate financial data for the specified period",
backstory="Expert CPA with 15 years experience in financial data extraction. "
"Known for thoroughness and data quality.",
llm=llm,
tools=[query_database, fetch_schema],
verbose=True,
max_iter=5, # guardrail: max iterations per task
max_rpm=10, # guardrail: max requests per minute
)
analyst = Agent(
role="Financial Analyst",
goal="Identify meaningful patterns, anomalies, and insights from financial data",
backstory="Former Goldman Sachs analyst specializing in forensic accounting. "
"Has sharp eyes for irregularities.",
llm=llm,
tools=[calculate, generate_chart],
verbose=True,
)
writer = Agent(
role="Financial Report Writer",
goal="Transform complex financial analysis into clear, actionable board-level reports",
backstory="Harvard MBA with communications background. "
"Specializes in making numbers tell compelling stories.",
llm=llm,
tools=[generate_report],
verbose=True,
)
# ─── Define Tasks ─────────────────────────────────────────
research_task = Task(
description="""Query the financial database and gather:
1. Q1-Q4 revenue broken down by segment
2. COGS and gross margin for all quarters
3. Operating expenses by department
4. Year-over-year comparison data
Important: Always use LIMIT 50 in queries. Flag any data gaps.""",
expected_output="Structured JSON with all financial data, sources, and data quality notes",
agent=researcher,
)
analysis_task = Task(
description="""Analyze the research data and:
1. Calculate QoQ and YoY growth rates
2. Compute margin trends
3. Flag ALL variances > 10% as anomalies
4. Provide confidence score (1-5) for each finding
Use provided data only. No assumptions.""",
expected_output="Analysis report with findings, anomalies list, and confidence scores",
agent=analyst,
context=[research_task], # Depends on research_task output
)
write_task = Task(
description="""Write the Q4 Financial Audit Report:
- Executive Summary (3 bullets max)
- Key Financial Metrics (table format)
- Anomalies & Risk Flags (with severity: High/Medium/Low)
- Recommendations (actionable, specific)
Tone: professional, concise, board-ready.""",
expected_output="Complete financial report in Markdown, ready for board review",
agent=writer,
context=[research_task, analysis_task],
)
# ─── Assemble the Crew ────────────────────────────────────
audit_crew = Crew(
agents=[researcher, analyst, writer],
tasks=[research_task, analysis_task, write_task],
process=Process.sequential, # or Process.hierarchical
verbose=True,
memory=True, # Enable shared crew memory
max_rpm=30, # Global rate limit
)
# ─── Run ──────────────────────────────────────────────────
result = audit_crew.kickoff(inputs={
"quarter": "Q4",
"year": "2024",
"company": "Acme Corp"
})
print(result)
5.4 AutoGen: Conversation-Driven Agents
AutoGen models multi-agent collaboration as a conversation between agents. Agents talk to each other until the task is done.
import autogen
# Configuration
config_list = [{"model": "claude-sonnet-4-20250514", "api_key": "..."}]
# ─── Define Agents ────────────────────────────────────────
orchestrator = autogen.AssistantAgent(
name="Orchestrator",
system_message="""You coordinate the financial audit team.
Route tasks to the appropriate specialist.
End the conversation with TERMINATE when audit is approved.""",
llm_config={"config_list": config_list},
)
researcher = autogen.AssistantAgent(
name="Researcher",
system_message="You gather financial data. Always verify data quality.",
llm_config={"config_list": config_list},
)
analyst = autogen.AssistantAgent(
name="Analyst",
system_message="You analyze financial data and identify anomalies.",
llm_config={"config_list": config_list},
)
# Human proxy — the user's representative in the conversation
user_proxy = autogen.UserProxyAgent(
name="HumanReviewer",
human_input_mode="TERMINATE", # Only ask human when TERMINATE received
max_consecutive_auto_reply=10, # Guardrail: max auto replies
code_execution_config=False,
)
# ─── Create Group Chat ────────────────────────────────────
groupchat = autogen.GroupChat(
agents=[user_proxy, orchestrator, researcher, analyst],
messages=[],
max_round=20, # Guardrail: max conversation rounds
speaker_selection_method="round_robin", # or "auto" for LLM-based routing
)
manager = autogen.GroupChatManager(
groupchat=groupchat,
llm_config={"config_list": config_list},
)
# ─── Initiate conversation ────────────────────────────────
user_proxy.initiate_chat(
manager,
message="Please conduct a complete Q4 2024 financial audit for Acme Corp."
)
🔧 Engineer’s Note: Choose your framework based on your need for control vs. speed:
- LangGraph: When you need full auditability, custom routing logic, and production-grade reliability. More code, more control.
- CrewAI: When you want agents that “feel like team members” and need to prototype quickly. Good for role-heavy workflows.
- AutoGen: When the coordination logic should emerge from agent conversation rather than be explicitly coded. Best for research and exploration, less suited for deterministic production workflows.
5.5 The HITL Interrupt: How Escalation Actually Works in Code
The orchestrator function in §5.2 routes to "human_review" when revision_count >= 3. But what does that node look like? In LangGraph, human-in-the-loop works via checkpoints + interrupts: the graph pauses execution, persists state, and resumes only after a human provides input.
# Adding HITL interrupt to the audit workflow
from langgraph.types import interrupt, Command
# ─── The Human Review Node ────────────────────────────────
def human_review_node(state: AuditTeamState) -> Command:
"""
Pause execution and wait for human input.
LangGraph serializes state to the checkpointer (DB).
The graph resumes when the human calls app.invoke() again
with their decision injected into the state.
"""
# interrupt() pauses the graph here and surfaces the
# payload to the caller. Execution stops until resumed.
decision = interrupt({
"type": "human_review_required",
"reason": f"Max revisions ({state['revision_count']}) reached.",
"draft_report": state.get("draft_report", ""),
"review_notes": state.get("review_notes", []),
"instructions": "Reply with: APPROVE or REJECT: <reason>",
})
# After human resumes the graph, `decision` contains their input
if decision.upper().startswith("APPROVE"):
return Command(
goto="orchestrator",
update={"approved": True, "final_report": state["draft_report"]},
)
else:
# Human rejected — extract reason and reset for rewrite
reason = decision.replace("REJECT:", "").strip()
return Command(
goto="orchestrator",
update={
"review_notes": state.get("review_notes", []) + [f"Human: {reason}"],
"draft_report": None, # force full rewrite
"revision_count": 0, # reset counter after human feedback
},
)
# ─── Register the node ────────────────────────────────────
audit_workflow.add_node("human_review", human_review_node)
# human_review is already in the routing map from §5.2:
# route_after_orchestrator → "human_review" → "human_review" node
# ─── How the caller resumes after interrupt ───────────────
config = {"configurable": {"thread_id": "audit-2024-12"}}
# Initial run — will pause at human_review if triggered
for event in audit_app.stream(initial_state, config):
if "__interrupt__" in event:
# Surface the interrupt payload to your UI
payload = event["__interrupt__"][0].value
print(f"⏸ Human review needed: {payload['reason']}")
print(f"Draft:\n{payload['draft_report'][:500]}...")
break
# ... (human reads the draft, types their decision) ...
human_decision = "APPROVE" # or "REJECT: Missing Q2 breakdown"
# Resume the graph with human input injected
for event in audit_app.stream(
Command(resume=human_decision), # inject decision into interrupted node
config,
):
print(event)
Execution Timeline:
┌──────────────┐ ┌──────────────┐ ┌───────────────────┐
│ orchestrator │──→│ human_review │──→│ interrupt() │
└──────────────┘ └──────────────┘ │ State saved ✅ │
│ Graph paused ⏸ │
│ Waiting... │
└────────┬──────────┘
│
Human types: "APPROVE" │
▼
┌───────────────────┐
│ Graph resumes ▶ │
│ approved = True │
│ → orchestrator │
│ → END │
└───────────────────┘
Key: State is persisted by MemorySaver (or PostgresSaver in
production) between pause and resume. The graph can be paused
for hours/days without losing context.
🔧 Engineer’s Note:
interrupt()is fundamentally different from just routing to ahuman_reviewnode that returns a hardcoded value.interrupt()actually suspends the Python process and persists state to disk/DB. This means your web server can handle 1,000 other requests while 50 audits are awaiting human review — each paused at their owninterrupt()checkpoint, waiting to be resumed when the human responds. For production, swapMemorySaverforPostgresSaverso state survives server restarts.
6. The Cost Explosion Problem
This is the most under-discussed challenge in multi-agent systems. Costs don’t scale linearly — they scale super-linearly.
6.1 Why Multi-Agent Costs Multiply
Token Cost Analysis:
SINGLE AGENT (10-step task):
─────────────────────────────────────────────────────
System prompt: ~1,000 tokens
Per step (avg): ~800 tokens (history grows each step)
10 steps total: ~9,000 tokens input + ~2,000 tokens output
→ Total: ~11,000 tokens
→ Cost at $3/MTok: ~$0.033
3-AGENT TEAM (same task, split across agents):
─────────────────────────────────────────────────────
Each agent has its own system prompt: 3 × 1,000 = 3,000 tokens
Each agent processes its task: 3 × 3,000 = 9,000 tokens
Inter-agent messages (handoffs): ~5,000 tokens
Orchestrator overhead: ~8,000 tokens
→ Total: ~25,000 tokens
→ Cost at $3/MTok: ~$0.075
PLUS:
Reviewer agent (review loop ×2): +10,000 tokens
→ Total: ~35,000 tokens
→ Cost at $3/MTok: ~$0.105
RESULT: 3 agents → 5× more tokens, not 3×
The super-linear cost comes from: orchestration overhead +
inter-agent message copying + review cycles.
Cost Growth by Agent Count:
Agents: 1 2 3 5 10
Tokens: 11K 20K 35K 85K 300K
Cost: $0.03 $0.06 $0.11 $0.26 $0.90
Multiplier vs. 1 agent:
1× 1.8× 3.2× 7.7× 27×
→ Cost grows faster than linearly with agent count
→ 10 agents costs 27× more than 1 agent for comparable work
6.2 Budget Control Strategies
# Multi-agent cost control patterns
from dataclasses import dataclass
from typing import Dict
@dataclass
class AgentBudget:
max_tokens_input: int = 50_000 # tokens per agent run
max_tokens_output: int = 5_000
max_iterations: int = 10
max_cost_usd: float = 2.00
class MultiAgentBudgetManager:
def __init__(self, team_budget_usd: float = 10.00):
self.team_budget = team_budget_usd
self.spent: Dict[str, float] = {}
self.total_spent: float = 0.0
def check_budget(self, agent_name: str, estimated_cost: float) -> bool:
"""Returns True if budget allows this call, False otherwise."""
if self.total_spent + estimated_cost > self.team_budget:
raise BudgetExceededError(
f"Team budget ${self.team_budget} would be exceeded. "
f"Spent so far: ${self.total_spent:.3f}"
)
return True
def record_cost(self, agent_name: str, actual_cost: float):
self.spent[agent_name] = self.spent.get(agent_name, 0) + actual_cost
self.total_spent += actual_cost
def report(self):
print(f"\n💰 Cost Report:")
for agent, cost in self.spent.items():
pct = (cost / self.total_spent * 100) if self.total_spent > 0 else 0
print(f" {agent}: ${cost:.3f} ({pct:.1f}%)")
print(f" TOTAL: ${self.total_spent:.3f} / ${self.team_budget:.2f}")
6.3 Cost Optimization Strategies
| Strategy | How | Savings |
|---|---|---|
| Route cheap tasks to smaller models | Researcher uses Claude Haiku, Analyst uses Claude Sonnet | 60-80% on routine tasks |
| Summarize inter-agent context | Don’t pass full data — pass summaries | 40-60% on handoff tokens |
| Limit review cycles | Max 2 revisions, then escalate to human | Prevents runaway review loops |
| Cache intermediate results | Store researcher output, reuse for re-runs | 100% savings on re-runs |
| Parallelize where possible | Run research + web search simultaneously | No token savings, but 50% time savings |
🔧 Engineer’s Note: Multi-agent systems should be built with an explicit cost model before writing a single line of code. Estimate: N agents × average turns × average tokens per turn = expected cost. Add 2× buffer for review cycles and orchestration overhead. If the expected cost is too high, redesign — reduce agents, smaller models for sub-tasks, or summarize aggressively. An unbudgeted multi-agent system in production is a budget incident waiting to happen.
6.4 Observability: LangSmith vs. Phoenix vs. Datadog
You can’t optimize what you can’t see. Multi-agent systems require trace-level visibility into every agent call, tool invocation, and inter-agent message. Here’s how the three major options compare:
| Feature | LangSmith | Phoenix (Arize) | Datadog |
|---|---|---|---|
| Focus | LangChain/LangGraph native | Open-source LLM observability | General APM + LLM add-on |
| Setup | 2 env vars (LANGCHAIN_API_KEY + LANGCHAIN_TRACING_V2=true) | pip install arize-phoenix, run server | Datadog Agent + LLM integration |
| Agent traces | ✅ Native — full graph execution tree | ✅ OpenTelemetry-based spans | ⚠️ Requires manual instrumentation |
| TTFT | ✅ Time-to-first-token per span | ✅ Streaming latency metrics | ✅ Custom metrics |
| Inter-agent flow | ✅ Parent/child span tree per agent | ✅ Trace waterfall | ⚠️ Manual context propagation |
| Judge-LLM integration | ✅ LangSmith Evaluators (built-in) | ✅ Evals framework | ❌ Custom build required |
| Cost tracking | ✅ Per-run token + cost breakdown | ✅ Token cost dashboards | ⚠️ Custom metrics |
| Self-hosted | ❌ SaaS only (free tier available) | ✅ Fully self-hostable | ✅ Self-hosted option |
| Best for | LangGraph projects | Privacy-sensitive / on-prem | Teams already using Datadog |
TTFT (Time to First Token) is the most important latency metric in streaming agent systems — it measures how long the user waits before seeing the first response token. In a multi-agent pipeline, TTFT compounds across agents:
TTFT in Multi-Agent Context:
User submits goal
│
├── Orchestrator LLM call: TTFT₁ = 450ms first plan token arrives
├── Researcher tool call: TTFT₂ = 0ms (DB query, not LLM)
├── Analyst LLM call: TTFT₃ = 380ms first analysis token
├── Writer LLM call: TTFT₄ = 520ms first report token
└── User sees result: Total wall time = ~8.5s
Key insight: TTFT measures LLM responsiveness per agent call.
Total latency = sum of all agent calls + tool execution time.
You need per-agent TTFT to identify which agent is the bottleneck.
# Phoenix (Arize) setup — self-hostable, open-source
import phoenix as px
from openinference.instrumentation.langchain import LangChainInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
# Start Phoenix server (local or remote)
px.launch_app() # opens at http://localhost:6006
# Auto-instrument all LangChain/LangGraph calls
LangChainInstrumentor().instrument()
# Every agent call now emits spans with:
# - agent name (from node name in LangGraph)
# - TTFT (time_to_first_token attribute)
# - total tokens (input + output)
# - latency (elapsed_ms)
# - tool calls made within the span
# - cost estimate
# ─── LangSmith setup — 2 lines ───────────────────────────
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "lsv2_..."
# Now every LangGraph run is traced automatically.
# View at: https://smith.langchain.com
# ─── Custom alerting on TTFT + cost anomalies ────────────
from dataclasses import dataclass
@dataclass
class AgentSLOs:
max_ttft_ms: int = 1000 # alert if first token > 1s
max_total_ms: int = 30000 # alert if total agent call > 30s
max_cost_usd: float = 0.50 # alert if single agent call > $0.50
def check_agent_slos(span_data: dict, slos: AgentSLOs):
alerts = []
if span_data["ttft_ms"] > slos.max_ttft_ms:
alerts.append(f"⚠️ TTFT exceeded: {span_data['ttft_ms']}ms > {slos.max_ttft_ms}ms")
if span_data["total_ms"] > slos.max_total_ms:
alerts.append(f"⚠️ Latency exceeded: {span_data['total_ms']}ms")
if span_data["cost_usd"] > slos.max_cost_usd:
alerts.append(f"💸 Cost spike: ${span_data['cost_usd']:.3f} for one call")
return alerts
🔧 Engineer’s Note: Start with LangSmith if you use LangGraph — it’s zero-config. Set two environment variables and every agent call, tool invocation, TTFT measurement, and cost is traced automatically. Move to Phoenix when you need self-hosting (financial data that can’t touch third-party SaaS) or when you want to run your Judge-LLM evaluations in the same observability platform. Datadog is the right choice if your company already runs Datadog for infrastructure — adding LLM-specific metrics to an existing Datadog setup is far easier than introducing a new observability stack.
7. Agent Evaluation: The Judge-LLM Pattern
The hardest question in multi-agent systems: how do you know the system actually did a good job?
7.1 The Evaluation Problem
Why Multi-Agent Output Is Hard to Evaluate:
Single-tool output:
"SELECT COUNT(*) FROM customers"
→ Returns: 2,847
→ Evaluation: easy (compare to ground truth)
Multi-agent audit report:
→ Accuracy: Are all numbers correct?
→ Completeness: Did it miss any anomalies?
→ Faithfulness: Are conclusions based on actual data?
→ Clarity: Is the report understandable to a non-technical reader?
→ Actionability: Are recommendations specific and implementable?
→ Human evaluation: accurate but slow and expensive
→ Rule-based evaluation: fast but brittle (can't handle nuance)
→ Judge-LLM evaluation: best compromise for production
7.2 The Judge-LLM Pattern
A Judge-LLM is a separate LLM call whose only job is to evaluate another LLM’s output:
from langchain_anthropic import ChatAnthropic
from pydantic import BaseModel, Field
class AuditReportEvaluation(BaseModel):
"""Structured evaluation of an audit report."""
correctness: int = Field(ge=1, le=5, description="Are all numbers accurate?")
completeness: int = Field(ge=1, le=5, description="Were all anomalies identified?")
faithfulness: int = Field(ge=1, le=5, description="Are conclusions based on provided data only?")
clarity: int = Field(ge=1, le=5, description="Is the report clear and professional?")
actionability: int = Field(ge=1, le=5, description="Are recommendations specific and actionable?")
issues: list[str] = Field(description="Specific problems found")
overall: int = Field(ge=1, le=5, description="Overall quality score")
verdict: str = Field(description="PASS (≥4.0 avg) or FAIL with explanation")
@property
def average_score(self) -> float:
return (self.correctness + self.completeness + self.faithfulness +
self.clarity + self.actionability) / 5
def judge_audit_report(
report: str,
source_data: str,
expected_anomalies: list[str]
) -> AuditReportEvaluation:
"""Use a powerful LLM to evaluate the audit report quality."""
judge_llm = ChatAnthropic(
model="claude-opus-4-20250514" # Use LARGER model as judge
).with_structured_output(AuditReportEvaluation)
judge_prompt = f"""You are a senior audit partner reviewing an AI-generated financial report.
Evaluate this report against the source data and known anomalies.
SOURCE DATA (ground truth):
{source_data[:3000]}
KNOWN ANOMALIES TO DETECT:
{chr(10).join(f"- {a}" for a in expected_anomalies)}
REPORT TO EVALUATE:
{report[:4000]}
Score each dimension 1-5:
- 5: Excellent, no issues
- 4: Good, minor issues
- 3: Acceptable, notable gaps
- 2: Poor, significant issues
- 1: Unacceptable, critical failures
Be rigorous. The report will be shown to the board."""
return judge_llm.invoke(judge_prompt)
# Usage in the pipeline:
evaluation = judge_audit_report(
report=state["final_report"],
source_data=state["raw_data"],
expected_anomalies=["Q3 COGS spike 23%", "Q2 margin compression"]
)
if evaluation.average_score < 4.0:
print(f"❌ Report FAILED (score: {evaluation.average_score:.1f}/5)")
print(f"Issues: {evaluation.issues}")
# Send back for revision
else:
print(f"✅ Report PASSED (score: {evaluation.average_score:.1f}/5)")
# Proceed to HITL approval
7.3 Judge-LLM Best Practices
Judge-LLM Design Rules:
1. LARGER MODEL AS JUDGE
Judge model should be equal or larger than the agent model.
Don't use Claude Haiku to judge Claude Sonnet's work.
Judge: Claude Opus / GPT-4o
Agent: Claude Sonnet / GPT-4o mini
2. STRUCTURED OUTPUT
Use Pydantic models to force the judge to score each dimension.
Unstructured "this is good/bad" is hard to act on programmatically.
3. REFERENCE DATA
Always include: ground truth data, expected outputs, known edge cases.
A judge without reference data is just an opinion machine.
4. CALIBRATE WITH HUMAN SPOT-CHECKS
Run judge evaluations in parallel with human reviews for 2-4 weeks.
Measure judge-human agreement rate (target: >80% agreement).
Adjust judge prompt if agreement is low.
5. FAIL FAST IN PIPELINES
Use judge as a gate, not a final report.
If score < threshold → retry → if still fails → escalate to human.
🔧 Engineer’s Note: Judge-LLM is not perfect, but it’s 100× better than no evaluation. The alternative — running multi-agent systems in production with no quality checks — is how you ship wrong numbers to your board. Judge-LLM catches the “obvious” failures (missing data, mathematical errors, hallucinated recommendations) automatically. Human reviewers can then focus on the ambiguous cases the judge flags with low confidence.
8. Challenges & Anti-Patterns
Multi-agent systems fail in predictable ways. Learn to recognize these patterns before they cost you.
8.1 The Politeness Loop (Agent Communication Deadlock)
The most insidious multi-agent failure: agents are excessively deferential to each other, leading to an infinite loop of non-action.
The Politeness Loop:
Writer: "Here is the draft report. Please review."
Reviewer: "Thank you! The draft looks good overall. Any specific areas
you'd like me to focus on?"
Writer: "Thank you for asking! Please focus on whatever you think
is most important."
Reviewer: "Of course! Is there any section you're less confident about?"
Writer: "That's a great question. Maybe you could review the anomalies
section? But really, any feedback is welcome."
Reviewer: "Absolutely! I'll take a comprehensive look. Anything else to
consider before I begin?"
Writer: "Not really! Please review at your discretion."
...
[Conversation ends after max_rounds limit is hit. Nothing was reviewed.]
Prevention:
# Force directional conversation with explicit task requirements
writer_handoff = Task(
description="""Review the attached draft. Your response MUST contain:
1. APPROVED (if all five quality criteria are met), OR
2. REVISION REQUIRED: [specific list of changes needed]
Do NOT ask clarifying questions. Do NOT express uncertainty.
Make a definitive judgment based on the provided criteria.""",
expected_output="Either 'APPROVED' or 'REVISION REQUIRED: [numbered list]'",
agent=reviewer,
)
8.2 The Echo Chamber
Multiple agents agree with each other without critical evaluation, leading to reinforced (possibly wrong) conclusions.
Echo Chamber Pattern:
Researcher: "Revenue grew 7.2% in Q4."
Analyst: "Confirmed — Q4 revenue growth of 7.2% is significant."
Writer: "Strong Q4 performance with 7.2% revenue growth."
Reviewer: "Report accurately reflects Q4's 7.2% revenue growth."
[No agent noticed that 7.2% growth on a declining baseline
actually means the business is still shrinking vs. 2 years ago.]
Prevention:
→ Give each agent different "lenses" in their system prompts:
Analyst: "Be skeptical. Find what's wrong or missing."
Reviewer: "Play devil's advocate. What would critics say?"
→ Add a dedicated "steelman and devil's advocate" agent role.
8.3 Context Amnesia in Long Pipelines
By the time information reaches the 4th or 5th agent in a pipeline, critical context from earlier agents may be lost or distorted.
Context Loss Across Agents:
Step 1 (Researcher): "Q3 COGS increased 23% due to a single large
one-time purchase of raw materials on 2024-09-15."
[Researcher passes a summary, not full detail]
Step 2 (Analyst): "Q3 COGS anomaly detected: +23% variance."
[Loses the "one-time" and "raw materials" context]
Step 3 (Writer): "Q3 shows a concerning cost trend with COGS up 23%."
[Now it sounds like a recurring problem, not a one-time event]
Step 4 (Reviewer): "Report correctly identifies the Q3 cost increase."
[Approves the now-misleading characterization]
Board: "Why are our costs trending up?"
CFO: "They're not — it was a one-time purchase!"
Prevention:
→ Include full anomaly details in every handoff, not just numbers
→ Use structured handoff templates that preserve context
→ Run a final "consistency check" agent that compares
final report to original source data
8.4 The Star Agent Problem
One agent becomes a bottleneck — everything routes through it, creating high load, high cost, and a single point of failure.
Star Pattern (anti-pattern):
All agents → Orchestrator → All agents
Orchestrator must:
- Route every inter-agent message
- Summarize and re-contextualize between agents
- Handle all errors and re-routing
→ Becomes a bottleneck as agent count grows
→ Single LLM with enormous context window
→ Very expensive and slow
Better: Mesh with direct agent-to-agent handoffs for sequential steps:
Researcher → Analyst → Writer → Reviewer
Orchestrator only handles routing exceptions and HITL decisions.
8.5 Anti-Pattern Summary
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Politeness Loop | Agents ask questions instead of completing tasks | Force structured outputs, add “Do NOT ask” to prompts |
| Echo Chamber | Agents reinforce each other’s errors | Give each agent a skeptical lens, add devil’s advocate role |
| Context Amnesia | Later agents miss critical early context | Use structured handoffs, compare final vs. source data |
| Star Bottleneck | Orchestrator handles everything | Direct agent handoffs for sequential steps |
| Uncapped Cost | Token bill explodes | Budget manager, per-agent limits, summarize handoffs |
| Silent Failure | Agent “completes” task with wrong output | Judge-LLM evaluation gate, structured output schemas |
🔧 Engineer’s Note: The root cause of most multi-agent failures is treating agents as reliable, rational employees. They’re not — they’re probabilistic systems with limited context that can be confused by ambiguity, misled by previous context, and exhausted by long conversation histories. Design your system to be robust to agent errors, not optimistic that errors won’t happen. Every handoff should include failure handling. Every review cycle should have a maximum revision count with a human escalation path.
9. Key Takeaways
9.1 When to Use Multi-Agent
Multi-Agent Decision Framework:
Does the task require 10+ distinct steps? → Maybe
Are there clearly distinct expert roles needed? → Maybe
Can steps run in parallel to save time? → Maybe
Is quality review of AI output critical (finance)? → Maybe
ALL of the above are true? → Multi-agent is justified
SOME of the above are true? → Start with single agent + more tools
NONE of the above are true? → Definitely single agent
Budget is > 5× expected single-agent cost?
→ Only proceed if the quality/speed gain justifies it
9.2 Summary Table
| Topic | Key Principle |
|---|---|
| Supervisor pattern | Best for most production workflows — clear routing, explicit control |
| Blackboard (LangGraph) | State flows through nodes — each agent reads previous results |
| A2A vs. MCP | MCP: agent-to-tools (vertical), A2A: agent-to-agent (horizontal) |
| Cost explosion | 3 agents = 3-5× cost, not 3×. Budget before you build |
| Smaller models for sub-tasks | Route cheap tasks (research, summarization) to smaller, cheaper models |
| Judge-LLM | Evaluate outputs automatically. Larger judge model than agent model |
| Politeness loop | Force structured outputs. “Do NOT ask clarifying questions.” |
| Context amnesia | Include full context in handoffs. Compare final report to source data |
| Max review cycles | 2-3 revisions max, then human escalation. Never unlimited |
9.3 The Architecture Evolution
AI 01 → Prompt Engineering (how to talk to LLMs)
AI 02 → Dev Tools (LLMs that write code)
AI 03 → RAG (LLMs that read private data)
AI 04 → MCP (universal tool connections)
AI 05 → Agents (LLMs that act autonomously)
AI 06 → Multi-Agent (agent teams that collaborate) ← NOW
↑ specialization + parallelization + cross-system coordination
AI 07 → Security (making all of the above safe)
AI 08 → Financial Application (putting it all together)
9.4 Bridge to AI 07
You now know how to build systems where multiple AI agents collaborate, each with their own tools, memory, and expertise. These systems are powerful — and that power creates risk.
What happens when a malicious user sends a prompt that the Researcher agent passes to the Analyst agent, which passes it to the Writer agent, which then includes it in a report that gets sent to the board? That’s an indirect prompt injection attack — and it’s trivially easy in a multi-agent pipeline.
In AI 07, we put guardrails on everything we’ve built: prompt injection defenses, output sanitization, data privacy controls, red teaming, and the Defense-in-Depth architecture that makes AI systems production-safe.
This is AI 06 of a 12-part series on production AI engineering. Continue to AI 07: AI Security — Defending the Probabilistic Attack Surface.