The Moment I Checked My Rate Limits
Three research agents into a multi-phase mission, Claude Code hit a rate limit. The mission stalled. Two implementation phases sat waiting for research that was stuck behind a "please wait" message.
The research itself was simple: scan documentation, gather context, summarize findings. Work that doesn't require Claude's reasoning power. But I was routing everything through Claude because it was easy. One CLI, one subscription, one mental model.
That stall cost me an hour. Not because the research was hard, but because I'd used my Claude capacity on work that didn't need it.
The Actual Economics
Here's what people get wrong about multi-LLM routing: it's not about per-token cost comparisons. I use Claude Code on a subscription. I'm not paying per token. Gemini CLI runs on an API key with a generous free tier. I'm not paying per token there either.
The real economics are about capacity allocation:
- Claude Code: Fixed subscription. Finite rate limits per hour. Every task I send burns capacity whether it's a complex architecture decision or a simple "scan these 50 files."
- Gemini CLI: Free tier API. Effectively unlimited for the volumes I use. One million token context window.
When I route research to Gemini, I'm not "saving money." I'm preserving Claude's rate limits for the tasks that actually need Claude's capabilities. The research still gets done — often better, because Gemini's 1M context window means it can ingest an entire codebase without chunking.
What Each Model Actually Does Well
The routing isn't arbitrary. Each provider has genuine strengths that the other lacks.
Gemini excels at breadth. Scanning 100 files. Summarizing documentation across an entire repository. Gathering competitive intelligence on a topic. Research phases where you need coverage, not precision. And it has a 1M token context window — five times what Claude offers. For research that involves reading large codebases or long documents, Gemini isn't just cheaper. It's better.
Claude excels at precision. Writing code that compiles on the first try. Making targeted edits to existing files without breaking surrounding context. Following complex multi-step instructions exactly. And critically: Claude Code can create and modify files on disk. Gemini CLI cannot. This isn't a soft preference — it's a hard constraint. Implementation tasks must go to Claude.
But here's the thing I didn't expect. The constraint that Gemini can't write files turned out to be a feature, not a limitation. Research agents that can't modify code are inherently safe. They can explore freely, scan everything, generate reports — and the worst they can do is produce bad analysis that a Claude implementation agent will ignore. The read-only constraint makes Gemini the perfect researcher.
The Routing Logic
The router lives in internal/router/. It classifies tasks by keyword matching and routes them to the right provider:
| Task Type | Routes To | Why |
|---|---|---|
| Research, exploration | Gemini | Free, 1M context, read-only is fine |
| Creative, reasoning | Gemini | Free tier handles it, preserves Claude limits |
| Implementation, fixes | Claude Sonnet | Needs file write access, precise edits |
| Code review, analysis | Claude Sonnet | Needs to understand code structure deeply |
| Security, architecture | Claude Opus | Highest stakes, needs best judgment |
The classifier is keyword-based. "Research authentication patterns" has "research" in it → Gemini. "Implement OAuth provider" has "implement" in it → Claude. It doesn't need to be sophisticated because the categories are broad enough that vocabulary gets it right most of the time.
The more interesting design is the file-creation safety net. The router checks for phrases like "create a file," "scaffold," "write to" — anything suggesting disk mutation — and forces those tasks to Claude regardless of other signals. A research task that includes "create a summary document" gets reclassified as implementation. Better to waste some Claude capacity than to send a write task to a model that can't write.
The Fallback Chain
Models fail. Rate limits hit. APIs go down. The router implements graceful degradation:
Preferred model (capability match)
→ Fallback: same-provider alternative
→ Fallback: cross-provider alternative
→ Default: Claude Sonnet (always available)If Gemini is down, research falls back to Claude Haiku — fast and cheap within the subscription. If Claude Opus is rate-limited, architecture tasks fall back to Sonnet. The system degrades toward "slightly less optimal" rather than "mission failed."
This matters for multi-phase missions that run for 30+ minutes. A mission with 5 phases can't afford to die because one API returned a 429 on Phase 3. The fallback chain keeps the mission moving, even if some phases use a less optimal model than planned.
There's also quota-aware routing. The system tracks how much Claude capacity has been used and, when limits are approaching, proactively shifts deferrable work to Gemini. Security tasks are exempt — they always get Claude Opus regardless of quota state. But research and exploration will automatically shift to Gemini when Claude is running hot.
What This Actually Looks Like in Practice
A typical orchestrator mission — say, "research and implement a new CLI feature" — decomposes into phases:
- Research phase → Gemini scans the codebase, reads docs, produces a findings report. Cost: $0. Claude rate limit impact: zero.
- Planning phase → Gemini synthesizes research into an implementation plan. Cost: $0. Rate limit impact: still zero.
- Implementation phase → Claude Sonnet writes the actual code, creates files, runs tests. Uses subscription capacity where it matters.
- Review phase → Claude Sonnet reviews the implementation. Worth the capacity because review quality directly affects code quality.
Three of four phases run on Gemini's free tier. Claude capacity is preserved for the two phases that genuinely need it. If I'd routed everything through Claude, the research phases would have consumed capacity that the implementation phase needs.
The throughput difference is real. I can spawn three Gemini research agents in parallel with zero concern about rate limits. Try that with Claude and you'll hit limits fast. Parallel research on Gemini → sequential implementation on Claude → more total work done per hour.
The Lesson Isn't About Cost
The original instinct — "use the cheapest model" — was wrong. The right framing is: use each model where it's strongest, and don't waste constrained resources on unconstrained work.
Gemini isn't just "cheaper Claude." It has a 1M context window that makes it genuinely better for research. Claude isn't just "expensive Gemini." It can write files, follow complex instructions precisely, and reason about code in ways that Gemini can't match. The router exists because these are different tools with different strengths, not different tiers of the same tool.
The AI industry talks about this as "model selection." I think of it as resource allocation. You have a fixed Claude budget (rate limits per hour) and an unlimited Gemini budget (free tier). The router's job is to make sure every Claude request is one that only Claude can handle. Everything else goes to Gemini. Not because Gemini is good enough. Because for research, Gemini is often better — and it's free.
Next: Starting Line: The Case for Personal AI
Related Reading
- How Multi-Agent Orchestration Works — The routing system in context of the full orchestration pipeline
- Teaching AI to Learn From Its Mistakes — How learnings make each routed task more effective over time