Hero image for Multi-Agent Systems: When One Brain Isn't Enough

Multi-Agent Systems: When One Brain Isn't Enough

AI Multi-Agent LangGraph CrewAI AutoGen A2A LLMOps Agent Orchestration

In AI 05, we built a single agent: one LLM, one loop, one goal. It could query databases, calculate figures, and generate reports. But some problems are too complex for one agent — just as some projects are too large for one engineer.

A financial audit requires a researcher who pulls the data, an analyst who identifies anomalies, a writer who drafts the report, and a reviewer who validates the conclusions. These are different competencies — different tools, different system prompts, different specializations. Cramming all of them into a single agent creates a generalist that does everything adequately but nothing excellently.

Multi-Agent Systems split complex tasks across specialized agents that collaborate. The architecture mirrors how high-performance human teams work: clear roles, defined handoffs, and a coordinator who keeps the whole operation aligned.

TL;DR: Multi-agent systems solve the specialization and parallelization problems that single agents can’t. One orchestrator coordinates multiple specialist agents — each with their own tools, prompts, and expertise. This article covers the four collaboration patterns, how to build them with LangGraph, CrewAI, and AutoGen, the critical cost explosion problem, and the anti-patterns that turn agent teams into chaos.

⚠️ Freshness Warning: Multi-agent frameworks evolve rapidly. This article focuses on architecture patterns that remain stable. Verify framework-specific APIs against current documentation.

┌──────────────────────────────────────────────────────────┐
│              Supervisor Pattern (most common)             │
│                                                          │
│             ┌──────────────┐                             │
│             │ Orchestrator │ ← receives goal, routes     │
│             │   (Agent 0)  │   tasks, collects results   │
│             └──────┬───────┘                             │
│          ┌─────────┼─────────┐                           │
│    ┌─────▼───┐ ┌───▼────┐ ┌──▼──────┐                   │
│    │Researcher│ │Analyst │ │Reviewer │                    │
│    │(Agent 1) │ │(Agent 2)│ │(Agent 3)│                   │
│    │tools:    │ │tools:  │ │tools:   │                   │
│    │web_search│ │calc    │ │validate │                   │
│    │db_query  │ │chart   │ │send_msg │                   │
│    └─────────┘ └────────┘ └─────────┘                   │
│                                                          │
│  Each agent: own system prompt, own tools, own LLM       │
│  Orchestrator: routes tasks, enforces HITL approvals     │
└──────────────────────────────────────────────────────────┘

Article Map

I — Theory Layer (why multi-agent?)

  1. Why Multi-Agent? — The specialization argument
  2. Collaboration Patterns — Supervisor, P2P, Hierarchical, Blackboard

II — Architecture Layer (how it’s built) 3. Communication Protocols — Messages, handoffs, shared memory 4. Agent-to-Agent (A2A) Protocol — Google’s inter-agent standard

III — Engineering Layer (building and operating) 5. Building Multi-Agent Systems — LangGraph, CrewAI, AutoGen 6. The Cost Explosion Problem — Super-linear scaling, budgets 7. Agent Evaluation: The Judge-LLM Pattern — Automated quality control 8. Challenges & Anti-Patterns — What kills production systems 9. Key Takeaways — Decision framework


1. Why Multi-Agent? The Specialization Argument

1.1 The Single-Agent Ceiling

A single agent has three fundamental limitations:

Single-Agent Limitations:

  1. CONTEXT WINDOW CEILING
     All tools, all history, all state must fit in one context window.
     A financial audit agent needs:
       - DB tools (query, insert, validate)
       - Web tools (search, fetch)
       - Communication tools (email, Slack)
       - Document tools (PDF, Excel)
       - 50+ steps of history
     → Total: 150K+ tokens → expensive and error-prone

  2. COGNITIVE OVERLOAD
     One LLM tries to be researcher, analyst, writer, AND reviewer.
     "Jack of all trades, master of none."
     → Each role needs different system prompts
     → Switching roles within one prompt degrades quality

  3. SEQUENTIAL EXECUTION
     All steps must happen in one long chain.
     → Can't parallelize: research + analysis + review simultaneously
     → A 3-hour job stays 3 hours instead of 1 hour with 3 agents

1.2 The Multi-Agent Solution

Multi-Agent vs. Single Agent:

  SINGLE AGENT:
  User Goal → [One LLM handles everything: research + analyze + write + review]
              → Long chain, large context, sequential, fragile

  MULTI-AGENT:
  User Goal


  Orchestrator: "This needs research first."

    ├──→ Researcher Agent:  web_search + db_query → raw data

    ├──→ Analyst Agent:     calculate + chart → insights (parallel!)

    ├──→ Writer Agent:      draft report → formatted output (uses insights)

    └──→ Reviewer Agent:    validate + ⏸ HITL → approved final

  Each agent:
  - Smaller context window → cheaper per call
  - Focused system prompt → better quality
  - Specialized tools → right tool for each job
  - Can run in parallel → faster end-to-end

Connection to AI 00 §6.3: This mirrors the Transformer’s Multi-Head Attention mechanism. In a Transformer, multiple attention heads simultaneously attend to different aspects of the input — one head for syntax, another for semantics, another for long-range dependencies. Multi-agent systems apply the same principle to task execution: multiple specialized agents simultaneously attend to different aspects of a complex problem.

1.3 When Multi-Agent Is (and Isn’t) Worth It

ConditionDecision
Task requires 10+ steps✅ Consider multi-agent
Task needs clearly distinct expertise (research vs. analysis vs. review)✅ Use multi-agent
Steps can run in parallel✅ Strong case for multi-agent
Quality review of AI output is critical✅ Use a dedicated reviewer agent
Task is simple and well-defined❌ Single agent is cheaper/simpler
Team size < 3 needed roles❌ Single agent with more tools
Budget is tight❌ Multi-agent costs 3-5× more (see §6)

2. Four Collaboration Patterns

Multi-agent systems don’t all work the same way. The collaboration pattern determines how agents communicate and coordinate:

2.1 Supervisor Pattern (Orchestrator + Workers)

The most common pattern. A central orchestrator routes tasks to specialist workers and assembles the results.

Supervisor Pattern:

                    ┌────────────────────┐
                    │    ORCHESTRATOR    │
                    │                   │
                    │  Receives goal    │
                    │  Decomposes task  │
                    │  Routes to agents │
                    │  Collects results │
                    │  Returns final    │
                    └─────────┬─────────┘
             ┌────────────────┼────────────────┐
             │                │                │
       ┌─────▼──────┐  ┌──────▼──────┐  ┌─────▼──────┐
       │  Worker A  │  │  Worker B   │  │  Worker C  │
       │ Researcher │  │  Analyst    │  │  Reviewer  │
       │            │  │             │  │            │
       │ web_search │  │ calculate   │  │ validate   │
       │ db_query   │  │ chart_gen   │  │ send_email │
       └─────────────┘  └─────────────┘  └────────────┘

  Communication flow:
  Orchestrator → Worker A: "Research Q4 revenue data"
  Worker A → Orchestrator: [raw data]
  Orchestrator → Worker B: "Analyze this data" + [raw data]
  Worker B → Orchestrator: [insights + charts]
  Orchestrator → Worker C: "Review and approve" + [draft + insights]
  Worker C → Orchestrator: [approved report] OR [revision requests]
  Orchestrator → User: [final report]

Best for: Sequential workflows with clear handoffs. Financial reporting, content creation pipelines, code review workflows.

Weakness: Single point of failure — if the orchestrator makes a wrong routing decision, the whole pipeline suffers.

2.2 Peer-to-Peer (Decentralized)

Agents communicate directly with each other, without a central coordinator. Each agent decides who to talk to next.

Peer-to-Peer Pattern:

  ┌──────────┐     ┌──────────┐     ┌──────────┐
  │  Agent A │────→│  Agent B │────→│  Agent C │
  │Researcher│     │Analyst   │     │Writer    │
  └──────────┘     └──────────┘     └──────────┘
        ↑                                  │
        └──────────────────────────────────┘
                  (feedback loop)

  Each agent decides:
  "I've completed my part. Who needs my output?"
  "I need X. Who can provide it?"

Best for: Creative workflows where the “next step” isn’t always predetermined. Brainstorming, exploration, R&D.

Weakness: Hard to audit, hard to debug, easy to create communication loops. Not recommended for production systems that require explainability.

2.3 Hierarchical (Manager → Team → Worker)

Multi-level supervision. A top-level orchestrator manages mid-level managers, who manage specialist workers.

Hierarchical Pattern:

  ┌─────────────────────────────────────────────────┐
  │           VP-LEVEL ORCHESTRATOR                  │
  │   "Prepare the monthly board package"            │
  └───────────┬─────────────────────┬───────────────┘
              │                     │
    ┌─────────▼──────┐    ┌─────────▼──────┐
    │  FINANCE TEAM  │    │ STRATEGY TEAM  │
    │   Manager      │    │   Manager      │
    └──┬──────┬──────┘    └──┬──────┬─────┘
       │      │              │      │
   ┌───▼──┐ ┌─▼────┐    ┌───▼──┐ ┌─▼────┐
   │Rev   │ │COGS  │    │Mkt   │ │Risk  │
   │Agent │ │Agent │    │Agent │ │Agent │
   └──────┘ └──────┘    └──────┘ └──────┘

  → Scales to large, complex domains
  → Each "team" has its own specialized context
  → Manager agents aggregate and synthesize sub-team results

Best for: Enterprise-scale workflows. Monthly close automation, due diligence, large codebase refactoring.

Weakness: High overhead (many agents = many tokens), complex orchestration, expensive to debug.

2.4 Blackboard Pattern (Shared Workspace)

All agents read and write to a shared “blackboard” (a structured state object). No direct agent-to-agent messages — coordination happens through the shared state.

Blackboard Pattern:

  ┌─────────────────────────────────────────────┐
  │              BLACKBOARD (Shared State)       │
  │                                             │
  │  goal:        "Q4 financial audit"          │
  │  raw_data:    [query results...]            │
  │  analysis:    {insights: [...]}             │
  │  anomalies:   ["Q3 COGS spike +23%"]       │
  │  draft:       "Q4 saw strong revenue..."   │
  │  review_status: "approved"                  │
  │  final_report: "..."                        │
  └──────┬──────────────┬──────────────┬────────┘
         │              │              │
    ┌────▼────┐   ┌─────▼───┐   ┌─────▼───┐
    │Researcher│   │Analyst  │   │Writer   │
    │reads:    │   │reads:   │   │reads:   │
    │ goal     │   │raw_data │   │analysis │
    │writes:   │   │writes:  │   │writes:  │
    │ raw_data │   │analysis │   │ draft   │
    └──────────┘   └─────────┘   └─────────┘

  LangGraph implements this naturally — AgentState IS the blackboard.
  Each node reads from and writes to the shared state dict.

Best for: LangGraph-based systems, workflows where agents naturally sequence by reading previous results.

Weakness: Requires careful state schema design upfront; complex dependency management.

🔧 Engineer’s Note: In practice, most production multi-agent systems combine patterns. A Supervisor at the top-level routes tasks (Supervisor pattern), while sub-tasks use a Blackboard for state sharing (Blackboard pattern) and a dedicated Reviewer communicates directly with the Writer (limited P2P for feedback). Don’t feel constrained to pick exactly one pattern.


3. Communication Protocols: How Agents Talk

Coordination between agents requires a well-defined communication protocol. There are three mechanisms:

3.1 Message Passing

The simplest form: agents send structured messages to each other.

Message Structure:

  {
    "from":    "researcher_agent",
    "to":      "analyst_agent",
    "type":    "task_result",
    "content": {
      "data": [...query results...],
      "metadata": {
        "source": "PostgreSQL financials DB",
        "rows": 847,
        "query_time_ms": 234
      }
    },
    "timestamp": "2024-12-31T09:15:00Z",
    "task_id":   "monthly-audit-2024-12"
  }

Key design decisions:

  • Schema: Define strict schemas for messages — unstructured strings lead to misinterpretation
  • Routing: Who decides which agent receives the message? (Orchestrator vs. self-routing)
  • Failure handling: What happens when the recipient agent fails?

3.2 Shared State (Blackboard)

All agents read and write to a shared state object. No direct messages — coordination is implicit.

# In LangGraph: AgentState IS the shared blackboard
class TeamState(TypedDict):
    # Input
    goal: str
    
    # Intermediate results (written by specific agents, read by others)
    raw_data:     Optional[list]          # written by: researcher
    analysis:     Optional[dict]          # written by: analyst
    draft_report: Optional[str]           # written by: writer
    review_notes: Optional[list[str]]     # written by: reviewer
    
    # Control flow
    current_agent: str                    # which agent should act next
    revision_count: int                   # how many review cycles
    
    # Output
    final_report: Optional[str]
    approved: bool

3.3 Agent Handoffs

A more structured form of message passing: one agent explicitly hands control to another, along with all relevant context.

Handoff Pattern:

  Researcher → Analyst (handoff):
    "Here is the data I collected. Your task: identify anomalies.
    Context you need:
    - Q1-Q4 revenue: [data]
    - YoY comparison baseline: [data]
    - Industry benchmarks: [reference]
    Focus particularly on: Q3 COGS variance (23% spike).
    Return: your analysis in structured JSON."

  Why handoffs are better than raw message passing:
  ┌─────────────────┬─────────────────────────────────────┐
  │ Raw Message     │ Handoff                             │
  ├─────────────────┼─────────────────────────────────────┤
  │ "Here's data."  │ "Here's data + context + your goal  │
  │                 │  + what to focus on + expected         │
  │                 │  output format."                    │
  │ Recipient must  │ Recipient has everything it needs.  │
  │ infer context.  │ Less guesswork, fewer errors.       │
  └─────────────────┴─────────────────────────────────────┘

🔧 Engineer’s Note: The quality of agent handoffs directly determines the quality of the multi-agent system. A sloppy handoff (“here’s the data, do something with it”) forces the downstream agent to re-derive context it shouldn’t need to derive. A precise handoff (“here’s the data, here’s what I found interesting, here’s your specific task, here’s the output format I expect”) feeds the downstream agent exactly what it needs. Invest time designing handoff message templates.


4. The A2A Protocol: Google’s Inter-Agent Standard

So far, we’ve discussed agent communication within a single system. But what about communication between systems — your company’s agent talking to a vendor’s agent, or a finance agent delegating to a specialized compliance agent from a different provider?

This is the problem that A2A (Agent-to-Agent) solves. Announced by Google in April 2025, with backing from Salesforce, SAP, MongoDB, Atlassian, and 50+ partners.

4.1 The Problem A2A Solves

Without A2A:

  YourCompany.FinanceAgent ──proprietary format──→ Vendor.TaxAgent

       └── Custom integration code required for every agent pair
           N finance agents × M tax agents = N×M integrations

With A2A:

  YourCompany.FinanceAgent ──A2A protocol──→ Vendor.TaxAgent
                                A2A is universal
                                N + M implementations, not N × M

  The same M×N → M+N problem that MCP solved for tools,
  A2A solves for agent-to-agent communication.

4.2 A2A Core Concepts

A2A Architecture:

  AGENT CARD (Discovery)
  ─────────────────────
  Each A2A-capable agent publishes an "Agent Card" at a well-known URL:
  https://vendor.com/.well-known/agent.json

  {
    "name": "TaxCompliance Agent",
    "description": "Handles tax calculations, filings, and compliance checks",
    "url": "https://vendor.com/a2a/tax-agent",
    "capabilities": ["tax_calculation", "form_1099", "vat_europe"],
    "authentication": {"type": "oauth2"},
    "skills": [
      {"name": "calculate_tax", "description": "..."},
      {"name": "validate_filing", "description": "..."}
    ]
  }

  TASK-BASED INTERACTION
  ──────────────────────
  A2A uses tasks (not function calls) as the fundamental unit:
  {
    "task_id": "tax-calc-001",
    "message": "Calculate Q4 2024 federal tax for revenue $2.042M",
    "artifacts": [{ "type": "data", "content": [financial_data] }]
  }
  → Agent processes task asynchronously
  → Returns result when complete (not a synchronous call)

  STREAMING SUPPORT
  ─────────────────
  Long-running tasks can stream intermediate updates:
  Agent → "Task received, starting calculation" (10% progress)
         → "Federal calculation complete" (60% progress)
         → "State calculations complete" (90% progress)
         → "Final tax liability: $542,310" (100% done)

4.3 MCP vs. A2A: Complementary Layers

How MCP (AI 04) and A2A Work Together:

  ┌─── YOUR AGENT ─────────────────────────────────────────┐
  │                                                        │
  │  Goal: "Complete Q4 tax filing"                        │
  │           │                                            │
  │           ├── READ local data via MCP                  │
  │           │   └→ MCP Server: PostgreSQL (financials)   │
  │           │      (AI 04 pattern)                       │
  │           │                                            │
  │           ├── DELEGATE calculation to specialist agent  │
  │           │   └→ A2A: TaxCompliance Agent (external)   │
  │           │      (AI 06 new pattern)                   │
  │           │                                            │
  │           └── SEND result via MCP tool                 │
  │               └→ MCP Server: Email (SMTP)              │
  │                  (AI 04 pattern)                       │
  │                                                        │
  └────────────────────────────────────────────────────────┘

  MCP:  Agent ↔ Tools/Data    (vertical — environment access)
  A2A:  Agent ↔ Agent         (horizontal — peer collaboration)
  They address different integration layers. Both are needed.

Connection to AI 04 §11.5: The Protocol Standards War between Anthropic (MCP) and Google (A2A) discussed in AI 04 is the strategic backdrop. In practice, production systems will likely use both: MCP for tool/data access during agent execution, and A2A for cross-system agent collaboration. Neither protocol eliminates the need for the other.


5. Building Multi-Agent Systems

Three major frameworks dominate multi-agent development. Each reflects a different philosophy:

5.1 Framework Comparison

FrameworkPhilosophyBest ForControl
LangGraphState machine — explicit graph of nodes and edgesProduction, complex workflows, audit trailsHigh
CrewAIRole-based — define “agents” like employees, “tasks” like job descriptionsRapid prototyping, role-heavy workflowsMedium
AutoGenConversation-driven — agents converse until task is completeResearch, exploratory multi-agent setupsLow

5.2 LangGraph: Multi-Agent with Full Control

LangGraph extends naturally from single-agent (AI 05) to multi-agent by making the graph itself a supervisor that routes between agent sub-graphs.

# Multi-agent financial audit system in LangGraph
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import MemorySaver
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
from typing import TypedDict, Annotated, Optional, Literal
from langgraph.graph.message import add_messages

# ─── Shared Team State (the Blackboard) ──────────────────
class AuditTeamState(TypedDict):
    messages:       Annotated[list, add_messages]
    goal:           str
    raw_data:       Optional[str]        # from Researcher
    analysis:       Optional[str]        # from Analyst
    draft_report:   Optional[str]        # from Writer
    review_notes:   Optional[list[str]]  # from Reviewer
    final_report:   Optional[str]
    approved:       bool
    revision_count: int
    next_agent:     str                  # routing control

# ─── Agent Factory ────────────────────────────────────────
def make_agent(role: str, system_prompt: str, tools: list):
    """Create a specialist agent with a specific role."""
    llm = ChatAnthropic(model="claude-sonnet-4-20250514")
    if tools:
        llm = llm.bind_tools(tools)
    
    def agent_fn(state: AuditTeamState) -> AuditTeamState:
        context = state.get("raw_data", "") or ""
        analysis = state.get("analysis", "") or ""
        draft = state.get("draft_report", "") or ""
        
        # Build role-specific prompt with available context
        user_msg = f"""Goal: {state['goal']}
        
Available context:
{f'Raw data: {context[:2000]}' if context else ''}
{f'Analysis: {analysis[:2000]}' if analysis else ''}
{f'Draft: {draft[:2000]}' if draft else ''}

Your task as {role}: Complete your part of this audit."""
        
        messages = [
            SystemMessage(content=system_prompt),
            HumanMessage(content=user_msg)
        ] + state["messages"][-3:]  # Last 3 messages for context
        
        response = llm.invoke(messages)
        return {"messages": [response]}
    
    return agent_fn

# ─── Define Specialist Agents ─────────────────────────────
researcher_agent = make_agent(
    role="Researcher",
    system_prompt="""You are a data researcher. Your job:
    1. Query the financial database for relevant data
    2. Always LIMIT queries to 50 rows (data truncation best practice)
    3. Return structured data with source metadata
    4. When done, write your findings to state""",
    tools=[query_database, fetch_schema]
)

analyst_agent = make_agent(
    role="Financial Analyst",
    system_prompt="""You are a financial analyst. Your job:
    1. Analyze the raw data provided
    2. Calculate growth rates, margins, variances
    3. Flag anomalies (variances > 10%)
    4. Return structured insights with confidence levels""",
    tools=[calculate, generate_chart]
)

writer_agent = make_agent(
    role="Report Writer",
    system_prompt="""You are a financial report writer. Your job:
    1. Transform analysis into a clear, professional report
    2. Structure: Executive Summary → Key Findings → Anomalies → Recommendations
    3. Use precise numbers and avoid vague language
    4. Format in Markdown""",
    tools=[generate_report]
)

reviewer_agent = make_agent(
    role="Quality Reviewer",
    system_prompt="""You are a quality reviewer. Your job:
    1. Verify all numbers are accurate and consistent
    2. Check that anomalies are properly flagged
    3. Ensure recommendations are actionable
    4. Return: 'APPROVED' OR specific revision requests""",
    tools=[validate_numbers]
)

# ─── Orchestrator Node ────────────────────────────────────
def orchestrator(state: AuditTeamState) -> AuditTeamState:
    """Routes tasks to the appropriate specialist agent."""
    
    # State-based routing logic
    if not state.get("raw_data"):
        return {"next_agent": "researcher"}
    
    if not state.get("analysis"):
        return {"next_agent": "analyst"}
    
    if not state.get("draft_report"):
        return {"next_agent": "writer"}
    
    if not state.get("approved", False):
        if state.get("revision_count", 0) >= 3:
            # Max revisions reached — escalate to human
            return {"next_agent": "human_review"}
        return {"next_agent": "reviewer"}
    
    return {"next_agent": "end"}

# ─── Researcher Node (with state update) ──────────────────
def run_researcher(state: AuditTeamState) -> AuditTeamState:
    result = researcher_agent(state)
    # Extract data from agent's response
    data = extract_data_from_response(result["messages"][-1])
    return {**result, "raw_data": data}

# ─── Reviewer Node (handles approval logic) ───────────────
def run_reviewer(state: AuditTeamState) -> AuditTeamState:
    result = reviewer_agent(state)
    response_text = result["messages"][-1].content
    
    if "APPROVED" in response_text.upper():
        return {**result, "approved": True, "final_report": state["draft_report"]}
    else:
        # Extract revision notes and send back to writer
        notes = parse_revision_notes(response_text)
        return {
            **result,
            "review_notes": notes,
            "draft_report": None,        # Clear draft for rewrite
            "revision_count": state.get("revision_count", 0) + 1,
        }

# ─── Routing function ──────────────────────────────────────
def route_after_orchestrator(state: AuditTeamState) -> str:
    return state.get("next_agent", "end")

# ─── Build the Graph ──────────────────────────────────────
audit_workflow = StateGraph(AuditTeamState)

# Add nodes
audit_workflow.add_node("orchestrator", orchestrator)
audit_workflow.add_node("researcher",   run_researcher)
audit_workflow.add_node("analyst",      lambda s: {**analyst_agent(s), "analysis": extract_analysis(analyst_agent(s)["messages"][-1])})
audit_workflow.add_node("writer",       lambda s: {**writer_agent(s), "draft_report": extract_draft(writer_agent(s)["messages"][-1])})
audit_workflow.add_node("reviewer",     run_reviewer)

# Set entry point
audit_workflow.set_entry_point("orchestrator")

# Orchestrator routes to any agent
audit_workflow.add_conditional_edges(
    "orchestrator",
    route_after_orchestrator,
    {
        "researcher": "researcher",
        "analyst":    "analyst",
        "writer":     "writer",
        "reviewer":   "reviewer",
        "end":        END,
    }
)

# All agents return to orchestrator
for agent in ["researcher", "analyst", "writer", "reviewer"]:
    audit_workflow.add_edge(agent, "orchestrator")

# Compile with checkpointing
memory = MemorySaver()
audit_app = audit_workflow.compile(checkpointer=memory)

5.3 CrewAI: Role-Based Multi-Agent

CrewAI uses a higher-level abstraction: you define Agents (roles) and Tasks (jobs), and the framework handles coordination.

from crewai import Agent, Task, Crew, Process
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-sonnet-4-20250514")

# ─── Define Agents (Roles) ────────────────────────────────
researcher = Agent(
    role="Financial Data Researcher",
    goal="Gather complete and accurate financial data for the specified period",
    backstory="Expert CPA with 15 years experience in financial data extraction. "
              "Known for thoroughness and data quality.",
    llm=llm,
    tools=[query_database, fetch_schema],
    verbose=True,
    max_iter=5,          # guardrail: max iterations per task
    max_rpm=10,          # guardrail: max requests per minute
)

analyst = Agent(
    role="Financial Analyst",
    goal="Identify meaningful patterns, anomalies, and insights from financial data",
    backstory="Former Goldman Sachs analyst specializing in forensic accounting. "
              "Has sharp eyes for irregularities.",
    llm=llm,
    tools=[calculate, generate_chart],
    verbose=True,
)

writer = Agent(
    role="Financial Report Writer",
    goal="Transform complex financial analysis into clear, actionable board-level reports",
    backstory="Harvard MBA with communications background. "
              "Specializes in making numbers tell compelling stories.",
    llm=llm,
    tools=[generate_report],
    verbose=True,
)

# ─── Define Tasks ─────────────────────────────────────────
research_task = Task(
    description="""Query the financial database and gather:
    1. Q1-Q4 revenue broken down by segment
    2. COGS and gross margin for all quarters
    3. Operating expenses by department
    4. Year-over-year comparison data
    
    Important: Always use LIMIT 50 in queries. Flag any data gaps.""",
    expected_output="Structured JSON with all financial data, sources, and data quality notes",
    agent=researcher,
)

analysis_task = Task(
    description="""Analyze the research data and:
    1. Calculate QoQ and YoY growth rates
    2. Compute margin trends
    3. Flag ALL variances > 10% as anomalies
    4. Provide confidence score (1-5) for each finding
    
    Use provided data only. No assumptions.""",
    expected_output="Analysis report with findings, anomalies list, and confidence scores",
    agent=analyst,
    context=[research_task],   # Depends on research_task output
)

write_task = Task(
    description="""Write the Q4 Financial Audit Report:
    - Executive Summary (3 bullets max)
    - Key Financial Metrics (table format)
    - Anomalies & Risk Flags (with severity: High/Medium/Low)
    - Recommendations (actionable, specific)
    
    Tone: professional, concise, board-ready.""",
    expected_output="Complete financial report in Markdown, ready for board review",
    agent=writer,
    context=[research_task, analysis_task],
)

# ─── Assemble the Crew ────────────────────────────────────
audit_crew = Crew(
    agents=[researcher, analyst, writer],
    tasks=[research_task, analysis_task, write_task],
    process=Process.sequential,   # or Process.hierarchical
    verbose=True,
    memory=True,                  # Enable shared crew memory
    max_rpm=30,                   # Global rate limit
)

# ─── Run ──────────────────────────────────────────────────
result = audit_crew.kickoff(inputs={
    "quarter": "Q4",
    "year": "2024",
    "company": "Acme Corp"
})
print(result)

5.4 AutoGen: Conversation-Driven Agents

AutoGen models multi-agent collaboration as a conversation between agents. Agents talk to each other until the task is done.

import autogen

# Configuration
config_list = [{"model": "claude-sonnet-4-20250514", "api_key": "..."}]

# ─── Define Agents ────────────────────────────────────────
orchestrator = autogen.AssistantAgent(
    name="Orchestrator",
    system_message="""You coordinate the financial audit team.
    Route tasks to the appropriate specialist.
    End the conversation with TERMINATE when audit is approved.""",
    llm_config={"config_list": config_list},
)

researcher = autogen.AssistantAgent(
    name="Researcher",
    system_message="You gather financial data. Always verify data quality.",
    llm_config={"config_list": config_list},
)

analyst = autogen.AssistantAgent(
    name="Analyst",
    system_message="You analyze financial data and identify anomalies.",
    llm_config={"config_list": config_list},
)

# Human proxy — the user's representative in the conversation
user_proxy = autogen.UserProxyAgent(
    name="HumanReviewer",
    human_input_mode="TERMINATE",   # Only ask human when TERMINATE received
    max_consecutive_auto_reply=10,  # Guardrail: max auto replies
    code_execution_config=False,
)

# ─── Create Group Chat ────────────────────────────────────
groupchat = autogen.GroupChat(
    agents=[user_proxy, orchestrator, researcher, analyst],
    messages=[],
    max_round=20,     # Guardrail: max conversation rounds
    speaker_selection_method="round_robin",  # or "auto" for LLM-based routing
)

manager = autogen.GroupChatManager(
    groupchat=groupchat,
    llm_config={"config_list": config_list},
)

# ─── Initiate conversation ────────────────────────────────
user_proxy.initiate_chat(
    manager,
    message="Please conduct a complete Q4 2024 financial audit for Acme Corp."
)

🔧 Engineer’s Note: Choose your framework based on your need for control vs. speed:

  • LangGraph: When you need full auditability, custom routing logic, and production-grade reliability. More code, more control.
  • CrewAI: When you want agents that “feel like team members” and need to prototype quickly. Good for role-heavy workflows.
  • AutoGen: When the coordination logic should emerge from agent conversation rather than be explicitly coded. Best for research and exploration, less suited for deterministic production workflows.

5.5 The HITL Interrupt: How Escalation Actually Works in Code

The orchestrator function in §5.2 routes to "human_review" when revision_count >= 3. But what does that node look like? In LangGraph, human-in-the-loop works via checkpoints + interrupts: the graph pauses execution, persists state, and resumes only after a human provides input.

# Adding HITL interrupt to the audit workflow
from langgraph.types import interrupt, Command

# ─── The Human Review Node ────────────────────────────────
def human_review_node(state: AuditTeamState) -> Command:
    """
    Pause execution and wait for human input.
    LangGraph serializes state to the checkpointer (DB).
    The graph resumes when the human calls app.invoke() again
    with their decision injected into the state.
    """
    # interrupt() pauses the graph here and surfaces the
    # payload to the caller. Execution stops until resumed.
    decision = interrupt({
        "type":         "human_review_required",
        "reason":       f"Max revisions ({state['revision_count']}) reached.",
        "draft_report": state.get("draft_report", ""),
        "review_notes": state.get("review_notes", []),
        "instructions": "Reply with: APPROVE or REJECT: <reason>",
    })
    
    # After human resumes the graph, `decision` contains their input
    if decision.upper().startswith("APPROVE"):
        return Command(
            goto="orchestrator",
            update={"approved": True, "final_report": state["draft_report"]},
        )
    else:
        # Human rejected — extract reason and reset for rewrite
        reason = decision.replace("REJECT:", "").strip()
        return Command(
            goto="orchestrator",
            update={
                "review_notes":  state.get("review_notes", []) + [f"Human: {reason}"],
                "draft_report":  None,       # force full rewrite
                "revision_count": 0,         # reset counter after human feedback
            },
        )

# ─── Register the node ────────────────────────────────────
audit_workflow.add_node("human_review", human_review_node)

# human_review is already in the routing map from §5.2:
# route_after_orchestrator → "human_review" → "human_review" node

# ─── How the caller resumes after interrupt ───────────────
config = {"configurable": {"thread_id": "audit-2024-12"}}

# Initial run — will pause at human_review if triggered
for event in audit_app.stream(initial_state, config):
    if "__interrupt__" in event:
        # Surface the interrupt payload to your UI
        payload = event["__interrupt__"][0].value
        print(f"⏸ Human review needed: {payload['reason']}")
        print(f"Draft:\n{payload['draft_report'][:500]}...")
        break

# ... (human reads the draft, types their decision) ...
human_decision = "APPROVE"  # or "REJECT: Missing Q2 breakdown"

# Resume the graph with human input injected
for event in audit_app.stream(
    Command(resume=human_decision),  # inject decision into interrupted node
    config,
):
    print(event)
Execution Timeline:

  ┌──────────────┐   ┌──────────────┐   ┌───────────────────┐
  │ orchestrator │──→│ human_review │──→│  interrupt()      │
  └──────────────┘   └──────────────┘   │  State saved ✅   │
                                        │  Graph paused ⏸  │
                                        │  Waiting...       │
                                        └────────┬──────────┘

                          Human types: "APPROVE" │

                                        ┌───────────────────┐
                                        │  Graph resumes ▶  │
                                        │  approved = True  │
                                        │  → orchestrator   │
                                        │  → END            │
                                        └───────────────────┘

  Key: State is persisted by MemorySaver (or PostgresSaver in
  production) between pause and resume. The graph can be paused
  for hours/days without losing context.

🔧 Engineer’s Note: interrupt() is fundamentally different from just routing to a human_review node that returns a hardcoded value. interrupt() actually suspends the Python process and persists state to disk/DB. This means your web server can handle 1,000 other requests while 50 audits are awaiting human review — each paused at their own interrupt() checkpoint, waiting to be resumed when the human responds. For production, swap MemorySaver for PostgresSaver so state survives server restarts.


6. The Cost Explosion Problem

This is the most under-discussed challenge in multi-agent systems. Costs don’t scale linearly — they scale super-linearly.

6.1 Why Multi-Agent Costs Multiply

Token Cost Analysis:

  SINGLE AGENT (10-step task):
  ─────────────────────────────────────────────────────
  System prompt:   ~1,000 tokens
  Per step (avg):  ~800 tokens (history grows each step)
  10 steps total:  ~9,000 tokens input + ~2,000 tokens output
  → Total: ~11,000 tokens
  → Cost at $3/MTok: ~$0.033

  3-AGENT TEAM (same task, split across agents):
  ─────────────────────────────────────────────────────
  Each agent has its own system prompt:   3 × 1,000 = 3,000 tokens
  Each agent processes its task:          3 × 3,000 = 9,000 tokens
  Inter-agent messages (handoffs):        ~5,000 tokens
  Orchestrator overhead:                  ~8,000 tokens
  → Total: ~25,000 tokens
  → Cost at $3/MTok: ~$0.075

  PLUS:
  Reviewer agent (review loop ×2):        +10,000 tokens
  → Total: ~35,000 tokens
  → Cost at $3/MTok: ~$0.105

  RESULT: 3 agents → 5× more tokens, not 3×
  The super-linear cost comes from: orchestration overhead +
  inter-agent message copying + review cycles.
Cost Growth by Agent Count:

  Agents:    1     2     3     5      10
  Tokens:    11K   20K   35K   85K    300K
  Cost:      $0.03 $0.06 $0.11 $0.26  $0.90
  
  Multiplier vs. 1 agent:
             1×    1.8×  3.2×  7.7×   27×

  → Cost grows faster than linearly with agent count
  → 10 agents costs 27× more than 1 agent for comparable work

6.2 Budget Control Strategies

# Multi-agent cost control patterns
from dataclasses import dataclass
from typing import Dict

@dataclass
class AgentBudget:
    max_tokens_input:  int = 50_000   # tokens per agent run
    max_tokens_output: int = 5_000
    max_iterations:    int = 10
    max_cost_usd:      float = 2.00

class MultiAgentBudgetManager:
    def __init__(self, team_budget_usd: float = 10.00):
        self.team_budget     = team_budget_usd
        self.spent:          Dict[str, float] = {}
        self.total_spent:    float = 0.0
    
    def check_budget(self, agent_name: str, estimated_cost: float) -> bool:
        """Returns True if budget allows this call, False otherwise."""
        if self.total_spent + estimated_cost > self.team_budget:
            raise BudgetExceededError(
                f"Team budget ${self.team_budget} would be exceeded. "
                f"Spent so far: ${self.total_spent:.3f}"
            )
        return True
    
    def record_cost(self, agent_name: str, actual_cost: float):
        self.spent[agent_name] = self.spent.get(agent_name, 0) + actual_cost
        self.total_spent += actual_cost
    
    def report(self):
        print(f"\n💰 Cost Report:")
        for agent, cost in self.spent.items():
            pct = (cost / self.total_spent * 100) if self.total_spent > 0 else 0
            print(f"  {agent}: ${cost:.3f} ({pct:.1f}%)")
        print(f"  TOTAL:    ${self.total_spent:.3f} / ${self.team_budget:.2f}")

6.3 Cost Optimization Strategies

StrategyHowSavings
Route cheap tasks to smaller modelsResearcher uses Claude Haiku, Analyst uses Claude Sonnet60-80% on routine tasks
Summarize inter-agent contextDon’t pass full data — pass summaries40-60% on handoff tokens
Limit review cyclesMax 2 revisions, then escalate to humanPrevents runaway review loops
Cache intermediate resultsStore researcher output, reuse for re-runs100% savings on re-runs
Parallelize where possibleRun research + web search simultaneouslyNo token savings, but 50% time savings

🔧 Engineer’s Note: Multi-agent systems should be built with an explicit cost model before writing a single line of code. Estimate: N agents × average turns × average tokens per turn = expected cost. Add 2× buffer for review cycles and orchestration overhead. If the expected cost is too high, redesign — reduce agents, smaller models for sub-tasks, or summarize aggressively. An unbudgeted multi-agent system in production is a budget incident waiting to happen.

6.4 Observability: LangSmith vs. Phoenix vs. Datadog

You can’t optimize what you can’t see. Multi-agent systems require trace-level visibility into every agent call, tool invocation, and inter-agent message. Here’s how the three major options compare:

FeatureLangSmithPhoenix (Arize)Datadog
FocusLangChain/LangGraph nativeOpen-source LLM observabilityGeneral APM + LLM add-on
Setup2 env vars (LANGCHAIN_API_KEY + LANGCHAIN_TRACING_V2=true)pip install arize-phoenix, run serverDatadog Agent + LLM integration
Agent traces✅ Native — full graph execution tree✅ OpenTelemetry-based spans⚠️ Requires manual instrumentation
TTFT✅ Time-to-first-token per span✅ Streaming latency metrics✅ Custom metrics
Inter-agent flow✅ Parent/child span tree per agent✅ Trace waterfall⚠️ Manual context propagation
Judge-LLM integration✅ LangSmith Evaluators (built-in)✅ Evals framework❌ Custom build required
Cost tracking✅ Per-run token + cost breakdown✅ Token cost dashboards⚠️ Custom metrics
Self-hosted❌ SaaS only (free tier available)✅ Fully self-hostable✅ Self-hosted option
Best forLangGraph projectsPrivacy-sensitive / on-premTeams already using Datadog

TTFT (Time to First Token) is the most important latency metric in streaming agent systems — it measures how long the user waits before seeing the first response token. In a multi-agent pipeline, TTFT compounds across agents:

TTFT in Multi-Agent Context:

  User submits goal

  ├── Orchestrator LLM call:  TTFT₁ = 450ms   first plan token arrives
  ├── Researcher tool call:   TTFT₂ = 0ms     (DB query, not LLM)
  ├── Analyst LLM call:       TTFT₃ = 380ms   first analysis token
  ├── Writer LLM call:        TTFT₄ = 520ms   first report token
  └── User sees result:       Total wall time = ~8.5s

  Key insight: TTFT measures LLM responsiveness per agent call.
  Total latency = sum of all agent calls + tool execution time.
  You need per-agent TTFT to identify which agent is the bottleneck.
# Phoenix (Arize) setup — self-hostable, open-source
import phoenix as px
from openinference.instrumentation.langchain import LangChainInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

# Start Phoenix server (local or remote)
px.launch_app()   # opens at http://localhost:6006

# Auto-instrument all LangChain/LangGraph calls
LangChainInstrumentor().instrument()

# Every agent call now emits spans with:
# - agent name (from node name in LangGraph)
# - TTFT (time_to_first_token attribute)
# - total tokens (input + output)
# - latency (elapsed_ms)
# - tool calls made within the span
# - cost estimate

# ─── LangSmith setup — 2 lines ───────────────────────────
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"]    = "lsv2_..."
# Now every LangGraph run is traced automatically.
# View at: https://smith.langchain.com

# ─── Custom alerting on TTFT + cost anomalies ────────────
from dataclasses import dataclass

@dataclass
class AgentSLOs:
    max_ttft_ms:   int   = 1000   # alert if first token > 1s
    max_total_ms:  int   = 30000  # alert if total agent call > 30s
    max_cost_usd:  float = 0.50   # alert if single agent call > $0.50

def check_agent_slos(span_data: dict, slos: AgentSLOs):
    alerts = []
    if span_data["ttft_ms"] > slos.max_ttft_ms:
        alerts.append(f"⚠️ TTFT exceeded: {span_data['ttft_ms']}ms > {slos.max_ttft_ms}ms")
    if span_data["total_ms"] > slos.max_total_ms:
        alerts.append(f"⚠️ Latency exceeded: {span_data['total_ms']}ms")
    if span_data["cost_usd"] > slos.max_cost_usd:
        alerts.append(f"💸 Cost spike: ${span_data['cost_usd']:.3f} for one call")
    return alerts

🔧 Engineer’s Note: Start with LangSmith if you use LangGraph — it’s zero-config. Set two environment variables and every agent call, tool invocation, TTFT measurement, and cost is traced automatically. Move to Phoenix when you need self-hosting (financial data that can’t touch third-party SaaS) or when you want to run your Judge-LLM evaluations in the same observability platform. Datadog is the right choice if your company already runs Datadog for infrastructure — adding LLM-specific metrics to an existing Datadog setup is far easier than introducing a new observability stack.


7. Agent Evaluation: The Judge-LLM Pattern

The hardest question in multi-agent systems: how do you know the system actually did a good job?

7.1 The Evaluation Problem

Why Multi-Agent Output Is Hard to Evaluate:

  Single-tool output:
  "SELECT COUNT(*) FROM customers"
  → Returns: 2,847
  → Evaluation: easy (compare to ground truth)

  Multi-agent audit report:
  → Accuracy: Are all numbers correct?
  → Completeness: Did it miss any anomalies?
  → Faithfulness: Are conclusions based on actual data?
  → Clarity: Is the report understandable to a non-technical reader?
  → Actionability: Are recommendations specific and implementable?
  
  → Human evaluation: accurate but slow and expensive
  → Rule-based evaluation: fast but brittle (can't handle nuance)
  → Judge-LLM evaluation: best compromise for production

7.2 The Judge-LLM Pattern

A Judge-LLM is a separate LLM call whose only job is to evaluate another LLM’s output:

from langchain_anthropic import ChatAnthropic
from pydantic import BaseModel, Field

class AuditReportEvaluation(BaseModel):
    """Structured evaluation of an audit report."""
    correctness:   int = Field(ge=1, le=5, description="Are all numbers accurate?")
    completeness:  int = Field(ge=1, le=5, description="Were all anomalies identified?")
    faithfulness:  int = Field(ge=1, le=5, description="Are conclusions based on provided data only?")
    clarity:       int = Field(ge=1, le=5, description="Is the report clear and professional?")
    actionability: int = Field(ge=1, le=5, description="Are recommendations specific and actionable?")
    
    issues:      list[str] = Field(description="Specific problems found")
    overall:     int       = Field(ge=1, le=5, description="Overall quality score")
    verdict:     str       = Field(description="PASS (≥4.0 avg) or FAIL with explanation")
    
    @property
    def average_score(self) -> float:
        return (self.correctness + self.completeness + self.faithfulness +
                self.clarity + self.actionability) / 5

def judge_audit_report(
    report: str,
    source_data: str,
    expected_anomalies: list[str]
) -> AuditReportEvaluation:
    """Use a powerful LLM to evaluate the audit report quality."""
    
    judge_llm = ChatAnthropic(
        model="claude-opus-4-20250514"  # Use LARGER model as judge
    ).with_structured_output(AuditReportEvaluation)
    
    judge_prompt = f"""You are a senior audit partner reviewing an AI-generated financial report.
    
Evaluate this report against the source data and known anomalies.

SOURCE DATA (ground truth):
{source_data[:3000]}

KNOWN ANOMALIES TO DETECT:
{chr(10).join(f"- {a}" for a in expected_anomalies)}

REPORT TO EVALUATE:
{report[:4000]}

Score each dimension 1-5:
- 5: Excellent, no issues
- 4: Good, minor issues  
- 3: Acceptable, notable gaps
- 2: Poor, significant issues
- 1: Unacceptable, critical failures

Be rigorous. The report will be shown to the board."""
    
    return judge_llm.invoke(judge_prompt)

# Usage in the pipeline:
evaluation = judge_audit_report(
    report=state["final_report"],
    source_data=state["raw_data"],
    expected_anomalies=["Q3 COGS spike 23%", "Q2 margin compression"]
)

if evaluation.average_score < 4.0:
    print(f"❌ Report FAILED (score: {evaluation.average_score:.1f}/5)")
    print(f"Issues: {evaluation.issues}")
    # Send back for revision
else:
    print(f"✅ Report PASSED (score: {evaluation.average_score:.1f}/5)")
    # Proceed to HITL approval

7.3 Judge-LLM Best Practices

Judge-LLM Design Rules:

  1. LARGER MODEL AS JUDGE
     Judge model should be equal or larger than the agent model.
     Don't use Claude Haiku to judge Claude Sonnet's work.
     Judge: Claude Opus / GPT-4o
     Agent: Claude Sonnet / GPT-4o mini

  2. STRUCTURED OUTPUT
     Use Pydantic models to force the judge to score each dimension.
     Unstructured "this is good/bad" is hard to act on programmatically.

  3. REFERENCE DATA
     Always include: ground truth data, expected outputs, known edge cases.
     A judge without reference data is just an opinion machine.

  4. CALIBRATE WITH HUMAN SPOT-CHECKS
     Run judge evaluations in parallel with human reviews for 2-4 weeks.
     Measure judge-human agreement rate (target: >80% agreement).
     Adjust judge prompt if agreement is low.

  5. FAIL FAST IN PIPELINES
     Use judge as a gate, not a final report.
     If score < threshold → retry → if still fails → escalate to human.

🔧 Engineer’s Note: Judge-LLM is not perfect, but it’s 100× better than no evaluation. The alternative — running multi-agent systems in production with no quality checks — is how you ship wrong numbers to your board. Judge-LLM catches the “obvious” failures (missing data, mathematical errors, hallucinated recommendations) automatically. Human reviewers can then focus on the ambiguous cases the judge flags with low confidence.


8. Challenges & Anti-Patterns

Multi-agent systems fail in predictable ways. Learn to recognize these patterns before they cost you.

8.1 The Politeness Loop (Agent Communication Deadlock)

The most insidious multi-agent failure: agents are excessively deferential to each other, leading to an infinite loop of non-action.

The Politeness Loop:

  Writer:   "Here is the draft report. Please review."
  Reviewer: "Thank you! The draft looks good overall. Any specific areas
             you'd like me to focus on?"
  Writer:   "Thank you for asking! Please focus on whatever you think
             is most important."
  Reviewer: "Of course! Is there any section you're less confident about?"
  Writer:   "That's a great question. Maybe you could review the anomalies
             section? But really, any feedback is welcome."
  Reviewer: "Absolutely! I'll take a comprehensive look. Anything else to
             consider before I begin?"
  Writer:   "Not really! Please review at your discretion."
  ...
  [Conversation ends after max_rounds limit is hit. Nothing was reviewed.]

Prevention:

# Force directional conversation with explicit task requirements
writer_handoff = Task(
    description="""Review the attached draft. Your response MUST contain:
    1. APPROVED (if all five quality criteria are met), OR
    2. REVISION REQUIRED: [specific list of changes needed]
    
    Do NOT ask clarifying questions. Do NOT express uncertainty.
    Make a definitive judgment based on the provided criteria.""",
    expected_output="Either 'APPROVED' or 'REVISION REQUIRED: [numbered list]'",
    agent=reviewer,
)

8.2 The Echo Chamber

Multiple agents agree with each other without critical evaluation, leading to reinforced (possibly wrong) conclusions.

Echo Chamber Pattern:

  Researcher: "Revenue grew 7.2% in Q4."
  Analyst:    "Confirmed — Q4 revenue growth of 7.2% is significant."
  Writer:     "Strong Q4 performance with 7.2% revenue growth."
  Reviewer:   "Report accurately reflects Q4's 7.2% revenue growth."
  
  [No agent noticed that 7.2% growth on a declining baseline
   actually means the business is still shrinking vs. 2 years ago.]

Prevention:
  → Give each agent different "lenses" in their system prompts:
    Analyst: "Be skeptical. Find what's wrong or missing."
    Reviewer: "Play devil's advocate. What would critics say?"
  → Add a dedicated "steelman and devil's advocate" agent role.

8.3 Context Amnesia in Long Pipelines

By the time information reaches the 4th or 5th agent in a pipeline, critical context from earlier agents may be lost or distorted.

Context Loss Across Agents:

  Step 1 (Researcher): "Q3 COGS increased 23% due to a single large
                        one-time purchase of raw materials on 2024-09-15."
  [Researcher passes a summary, not full detail]
  
  Step 2 (Analyst):    "Q3 COGS anomaly detected: +23% variance."
  [Loses the "one-time" and "raw materials" context]
  
  Step 3 (Writer):     "Q3 shows a concerning cost trend with COGS up 23%."
  [Now it sounds like a recurring problem, not a one-time event]
  
  Step 4 (Reviewer):   "Report correctly identifies the Q3 cost increase."
  [Approves the now-misleading characterization]
  
  Board:               "Why are our costs trending up?"
  CFO:                 "They're not — it was a one-time purchase!"

Prevention:
  → Include full anomaly details in every handoff, not just numbers
  → Use structured handoff templates that preserve context
  → Run a final "consistency check" agent that compares
    final report to original source data

8.4 The Star Agent Problem

One agent becomes a bottleneck — everything routes through it, creating high load, high cost, and a single point of failure.

Star Pattern (anti-pattern):

  All agents → Orchestrator → All agents
  
  Orchestrator must:
  - Route every inter-agent message
  - Summarize and re-contextualize between agents
  - Handle all errors and re-routing
  → Becomes a bottleneck as agent count grows
  → Single LLM with enormous context window
  → Very expensive and slow

Better: Mesh with direct agent-to-agent handoffs for sequential steps:
  Researcher → Analyst → Writer → Reviewer
  Orchestrator only handles routing exceptions and HITL decisions.

8.5 Anti-Pattern Summary

Anti-PatternSymptomFix
Politeness LoopAgents ask questions instead of completing tasksForce structured outputs, add “Do NOT ask” to prompts
Echo ChamberAgents reinforce each other’s errorsGive each agent a skeptical lens, add devil’s advocate role
Context AmnesiaLater agents miss critical early contextUse structured handoffs, compare final vs. source data
Star BottleneckOrchestrator handles everythingDirect agent handoffs for sequential steps
Uncapped CostToken bill explodesBudget manager, per-agent limits, summarize handoffs
Silent FailureAgent “completes” task with wrong outputJudge-LLM evaluation gate, structured output schemas

🔧 Engineer’s Note: The root cause of most multi-agent failures is treating agents as reliable, rational employees. They’re not — they’re probabilistic systems with limited context that can be confused by ambiguity, misled by previous context, and exhausted by long conversation histories. Design your system to be robust to agent errors, not optimistic that errors won’t happen. Every handoff should include failure handling. Every review cycle should have a maximum revision count with a human escalation path.


9. Key Takeaways

9.1 When to Use Multi-Agent

Multi-Agent Decision Framework:

  Does the task require 10+ distinct steps?           → Maybe
  Are there clearly distinct expert roles needed?     → Maybe
  Can steps run in parallel to save time?             → Maybe
  Is quality review of AI output critical (finance)?  → Maybe
  
  ALL of the above are true?     → Multi-agent is justified
  SOME of the above are true?    → Start with single agent + more tools
  NONE of the above are true?    → Definitely single agent
  
  Budget is > 5× expected single-agent cost?
  → Only proceed if the quality/speed gain justifies it

9.2 Summary Table

TopicKey Principle
Supervisor patternBest for most production workflows — clear routing, explicit control
Blackboard (LangGraph)State flows through nodes — each agent reads previous results
A2A vs. MCPMCP: agent-to-tools (vertical), A2A: agent-to-agent (horizontal)
Cost explosion3 agents = 3-5× cost, not 3×. Budget before you build
Smaller models for sub-tasksRoute cheap tasks (research, summarization) to smaller, cheaper models
Judge-LLMEvaluate outputs automatically. Larger judge model than agent model
Politeness loopForce structured outputs. “Do NOT ask clarifying questions.”
Context amnesiaInclude full context in handoffs. Compare final report to source data
Max review cycles2-3 revisions max, then human escalation. Never unlimited

9.3 The Architecture Evolution

AI 01  → Prompt Engineering (how to talk to LLMs)
AI 02  → Dev Tools (LLMs that write code)
AI 03  → RAG (LLMs that read private data)
AI 04  → MCP (universal tool connections)
AI 05  → Agents (LLMs that act autonomously)
AI 06  → Multi-Agent (agent teams that collaborate) ← NOW
         ↑ specialization + parallelization + cross-system coordination
AI 07  → Security (making all of the above safe)
AI 08  → Financial Application (putting it all together)

9.4 Bridge to AI 07

You now know how to build systems where multiple AI agents collaborate, each with their own tools, memory, and expertise. These systems are powerful — and that power creates risk.

What happens when a malicious user sends a prompt that the Researcher agent passes to the Analyst agent, which passes it to the Writer agent, which then includes it in a report that gets sent to the board? That’s an indirect prompt injection attack — and it’s trivially easy in a multi-agent pipeline.

In AI 07, we put guardrails on everything we’ve built: prompt injection defenses, output sanitization, data privacy controls, red teaming, and the Defense-in-Depth architecture that makes AI systems production-safe.


This is AI 06 of a 12-part series on production AI engineering. Continue to AI 07: AI Security — Defending the Probabilistic Attack Surface.