Multi-Agent Workflows Often Fail. Here's How to Engineer Ones That Don't.

🟢 READ | ⏱ 4 min | 📡 8/10 | 🎯 Engineers building multi-agent systems, AI platform architects

TL;DR

Multi-agent systems fail not because of model capability but missing structure. GitHub's three patterns: (1) typed schemas at every agent boundary so invalid messages fail fast instead of propagating bad state; (2) constrained action schemas — agents must return exactly one valid action from an explicit set, not invent their own; (3) MCP as the enforcement layer that validates calls before execution. The mental model shift: agents are distributed system components, not smart chatbots.

Signal

The root failure pattern is agents making implicit assumptions about state, ordering, and validation — exactly like distributed systems race conditions. The fix is the same: explicit contracts
Action schema example is immediately usable: z.discriminatedUnion("type", [...]) where invalid returns trigger retry/escalation, not silent failure
MCP adds input/output schema validation before execution — this prevents agents from inventing fields or drifting across calls, which is the #1 silent failure mode

What They're NOT Telling You

This is a GitHub blog post, not an academic paper — it's a well-written opinion backed by GitHub's real-world experience, but "we've seen" and "through our work" are not quantified. No failure rate data, no before/after metrics. The MCP recommendation is also convenient given GitHub's own investment in MCP tooling. Good patterns, but not empirically validated at scale in the way a research paper would be.

Trust Check

Factuality ✅ | Author Authority ✅ (GitHub engineering, real production experience) | Actionability ✅