
The routing table in SKILL.md was clear: Gemini Flash for research phases, Gemini Thinking for complex reasoning, Claude Sonnet for implementation. "150x cheaper," read the comment next to the Flash entry. The table had been there for months, referenced in architecture docs, cited in planning notes. We ran a six-phase research mission expecting to see Gemini handle the first two phases. The telemetry showed Claude Sonnet on all six. So I read the Go source.
Every model tier -- think, work, quick -- routed to a Claude model. Opus for architecture, Sonnet for implementation, Haiku for formatting. Gemini appeared exactly once in the codebase: gemini-embedding-001, generating vector embeddings for learning retrieval. The routing table was aspirational documentation. Cross-model orchestration had been designed, documented, and never built. And the 279 missions that ran without it had a 92.1% success rate.
That gap -- between the architecture we described and the architecture that actually ran -- is the same gap most frameworks shipping this week haven't closed.
The HN Wave
On March 5, a Show HN post for Stoneforge scored 49 in its first day. The project is three weeks old, TypeScript, Apache 2.0, 33 stars. Its README describes a Director Agent that decomposes goals into dependency-ordered tasks, a Dispatch Daemon that polls for idle workers every five seconds, and Ephemeral Workers that execute in isolated git worktrees. VibeHQ shipped a few weeks earlier: a WebSocket hub where human-named agents -- Jordan (Frontend Engineer), Alex (Backend Engineer), Morgan (QA) -- pass messages through a broker that never makes a single LLM call itself.
Both projects solve real problems. Stoneforge's "Passing of the Chisel" is the best solution I've seen to context window degradation: when a worker approaches its limit, it commits code, writes structured handoff notes, preserves metadata, and unassigns. A fresh worker picks up from the handoff context. VibeHQ's idle-aware message queue buffers messages while an agent is generating and flushes them on transition to idle, preventing the interrupt-mid-generation failure that plagues naive multi-agent setups.
But both are built on the same core assumption: that modeling agents as team members improves what gets produced.
The Teammates Assumption and Why It Breaks
VibeHQ's agent definition is six JSON properties: name, role, cli, cwd, permissions, directories. The type system has a capabilities field. It is never used for routing. All addressing is by human name string. Nine role presets ship with 500 to 800-word system prompts. Every agent receives the same 20 MCP tools regardless of role.
The contracts that PM agents are supposed to enforce are prompt-based, not system-enforced. The Hub tracks task state through seven lifecycle transitions -- created, queued, accepted, rejected, in_progress, blocked, done -- but never evaluates whether an artifact is correct. The Hub routes and delivers. It does not judge. When an agent produces bad output, the PM agent must notice through its own LLM reasoning. Nothing in the Hub code prevents a developer agent from writing code before a contract is signed.
This is not a criticism of VibeHQ specifically. It is the structural constraint of the team metaphor applied to software systems: you cannot delegate accountability to a message broker.
The academic evidence is more nuanced than framework READMEs suggest. MetaGPT's ICLR 2024 ablation study is the strongest argument for role specialization: removing roles dropped executability from 4.0 to 1.0 on greenfield code generation, a controlled result with a real mechanism behind it. But the benchmark evaluates programs small enough to fit in a single session. The 2025 scaling study is less favorable: average performance across all multi-agent configurations was -3.5% compared to a single agent, with variance between -70% and +81% depending on the task. Above roughly 45% single-agent accuracy, adding agents hurt more than it helped.
The persona prompting data is worse. A 2024 study tested 162 personas across 9 models and 2,410 factual questions. Adding personas to system prompts produced no improvement compared to a no-persona control. Some personas actively decreased accuracy. Automatic persona selection performed no better than random selection. A "senior security engineer" persona may produce output that looks more like a security review without catching more vulnerabilities.
The team model's genuine value is in comprehension, not performance. MetaGPT's roles may produce their benefit through structured handoffs rather than through the team metaphor itself -- nobody has tested this directly. You could have structured handoffs without human names, role presets, and hub-mediated message passing. CrewAI's 100,000 certified developers and 450 million agents per month are strong evidence the team abstraction is legible enough for rapid adoption. They are not evidence it produces better software.

What We Learned from the Cross-Model Failure
The routing table existed because the cost argument was sound. Gemini Flash is substantially cheaper than Claude Sonnet for research phases -- agents collecting context, summarizing sources, compiling briefs. Across hundreds of missions, the savings would be real. The architecture diagram was clean. The table was documented.
The implementation died at quality validation.
To route a research phase to Gemini Flash, you need to know whether Gemini's output is good enough to hand off to the Claude implementation phase that depends on it. To answer that question, you need a quality gate that understands both models' output formats, failure modes, and capability profiles at the phase level. A gate capable of that judgment is itself a frontier LLM call. The cost of the gate erases the routing arbitrage.
This is not solvable by making the routing table smarter. It is structural. Cross-model orchestration at the phase level requires per-model quality calibration that costs what the arbitrage intended to save. Every framework that documents "route research to Model A, reasoning to Model B" without shipping the quality validation layer is publishing aspirational architecture.
The result of abandoning the cross-model routing: 279 missions on Claude alone, 92.1% success rate, 94.2% phase completion across 1,106 phases. The routing table stayed in SKILL.md as a design intention. The missions ran on the simpler system.
The Phases-with-Personas Alternative
The orchestrator that ran those 279 missions used a Go binary as the coordination layer. Not an LLM, not a PM agent, not a hub. Deterministic scheduling of a dependency graph with a three-worker concurrency semaphore. Personas were markdown prompt templates, selected by a single Haiku call with keyword scoring as fallback.
Twenty-two personas across 1,106 phases: backend-engineer took 21.6% of phases, researcher 12.5%, frontend-engineer 8.8%, storyteller 8.4%. The persona selection was a routing decision, not a social one. Cross-phase communication was string concatenation:
priorContext := strings.Join(priorParts, "\n\n---\n\n")Phase 2 did not negotiate with Phase 1. It received Phase 1's full text output injected into its CLAUDE.md under "Context from Prior Work." No messages. No contracts. No hub. The prior output became the next phase's briefing document. Each phase spawned a fresh Claude CLI subprocess with a generated context bundle: objective, persona prompt, relevant skills, prior context, and learnings retrieved from previous missions.
The 279-mission history included a six-phase UI polish mission (1,599 seconds), a six-phase learnings injection reform (2,126 seconds), and a maximum-duration mission that ran 15 hours. Average: 27.7 minutes, 4.0 phases, 0.22 retries per mission.

What Actually Matters
Three mechanisms account for the 92.1% number. None of them is the team metaphor.
Context isolation. Separate subprocesses with separate context windows prevent state accumulation from degrading later phases. Stoneforge's git worktree isolation targets this for a different reason -- merge conflict avoidance -- but the mechanism matters regardless of framing. A 12-phase mission where each phase starts clean maintains its quality floor across the full run. A 12-phase mission where agents share a context window does not.
Deterministic handoffs. The Go binary makes routing decisions without LLM calls: keyword scores, persona catalog, dependency graph. Decomposition calls Opus once; everything after that is deterministic. When a phase fails, the transitive dependency skip -- all downstream phases that can no longer receive valid input -- is computed, not inferred. The 61 retries across 279 missions (0.22 per mission) happened in the work layer, not the coordination layer. The coordinator does not hallucinate.
Learnings feedback. 5,130 learnings retrieved across 279 missions fed back into subsequent phases via semantic search. This is the one place Gemini runs in production: gemini-embedding-001 generating 3,072-dimension embeddings for retrieval. Not as an execution model, as an embedding model. The learnings system is what makes mission 150 more reliable than mission 15. The system knows which approaches failed, which personas work for specific task types, which library versions broke. Each mission's output feeds the next mission's context.
None of these require agents to be named Jordan or Alex. None require a contract-signing workflow. None require a director agent to decompose goals through LLM reasoning. They require process isolation, a deterministic scheduler, and a feedback loop that persists beyond the session.
The Problem That Is Still Open
Stoneforge's Steward component is the closest answer in the current wave. Stewards handle merge conflicts, documentation, and recovery -- quality gates with specialized roles and structured triggers. But the Steward's quality judgment is itself an LLM call. An LLM evaluating software correctness is doing the hardest thing in software engineering. The Steward is as fallible as any other worker in the system.
VibeHQ's PM agent is supposed to catch bad output through its own LLM reasoning. The probability it catches every failure is bounded by the same ceiling that limits any individual agent.
The frameworks shipping this week have solved coordination -- idle-aware queuing, context window continuity, dependency-aware dispatch. These are genuine contributions. What none of them has solved is accountability: who evaluates output quality, what happens when that evaluator is wrong, and how the system recovers without human intervention.
MetaGPT's role ablation result is real. Structured roles with defined handoffs do reduce cascading errors in greenfield code generation. The mechanism is not the team metaphor. It is the structured handoff. You could capture the same benefit with a pipeline that has no human names, no contracts, no PM agents -- just phases, personas, and deterministic context passing between them.
The team metaphor makes multi-agent systems legible. That is valuable. The 279-mission production history suggests legibility and performance are different optimization targets, and that the current wave is optimizing for the wrong one.