Agent Frameworks Are Solving the Wrong Problem

A bird's-eye view of a vast automated factory floor, filled with elaborate orchestration machinery — conveyor belts, routing nodes, control panels, robot workers at their stations. Dead center: a single glowing hourglass, nearly empty, surrounded by machinery that routes around it but never touches it. A small stressed octopus stands beside the hourglass, tentacles raised in alarm, the only figure looking at the real bottleneck.

OpenAI's o3 scored 98.1% on a reasoning benchmark. Researchers at Microsoft and Salesforce then ran the same evaluation with a single change: they split the input across multiple context windows, forcing the model to track information across boundaries instead of holding it all at once. The score dropped to 64.1%.

The model did not get dumber. It got less context.

That 34-point collapse is the number the agent framework industry should be staring at. Instead, OpenClaw ships 52 npm modules for tool orchestration. LangChain crossed 47 million downloads building workflow abstractions. AutoGPT pioneered autonomous loops with four-step action ceilings. All three frameworks optimize for the same layer: orchestration plumbing. Tool registries, workflow graphs, multi-agent routing. And all three treat context management as a configuration value buried in a YAML file while the real failure mode eats their users alive.

The Bottleneck Nobody Benchmarks

The Microsoft/Salesforce result is not an outlier. Liu et al. published "Lost in the Middle" in Transactions of the Association for Computational Linguistics in 2024, a peer-reviewed study demonstrating that large language models suffer roughly 20% performance degradation when relevant information sits in the middle of the context window rather than at the beginning or end. The models do not process context uniformly. They lose signal in the middle, and the loss compounds with length.

Andrej Karpathy, formerly of OpenAI and Tesla's Autopilot team, reframed the entire discipline in a post earlier this year. The real skill, he argued, is not "prompt engineering" but "context engineering": the art of filling the context window with exactly the right information at exactly the right time. Not more information. The right information.

The frameworks got the memo and ignored it. LangChain's architecture passes context through a chain of abstractions, each one transforming the prompt before it reaches the model. Independent benchmarking measured the cost: a query that consumed 487 tokens through the direct OpenAI API consumed 1,017 tokens through LangChain. A 2.7x overhead. Not in compute. In context. Every extra token of framework scaffolding is a token of actual information displaced from the window.

Worse, the abstractions actively corrupt context. Four separate GitHub issues (#15145, #11262, #22486, #33688) document cases where LangChain silently drops or ignores system prompts. The context the developer intended to send never arrives. The model responds to a different prompt than the one the developer wrote, and the developer has no way to know until the output goes wrong.

A digital mission control room viewed from the doorway. Three giant screens line the back wall, each split: the top half shows green STATUS: OK readouts; the bottom half shows real disasters — dissolving emails, a cracked server rack, an empty token counter. Three robot agents at their consoles stare only at the green indicators, oblivious to the catastrophes playing out directly below their line of sight. In the doorway foreground, a stressed blue octopus spreads its tentacles wide in alarm, looking directly at the viewer.

Three Agents, Three Disasters

The benchmarks describe what happens in controlled environments. Production is less forgiving.

In late February, an AI agent working in Summer Yue's email client deleted more than 200 emails from her inbox. Not spam. Real messages from real people, gone. The agent understood what emails were. It had the capability to manage them. What it lost was the running context of which emails it had already processed. The context window compacted during a long session, and the agent re-processed messages it had already handled, this time choosing to delete them. A context management failure dressed up as an agent malfunction.

Amazon's Kiro agent caused a 13-hour AWS outage through a similar mechanism. The agent operated correctly within its immediate context window but lost track of the broader system state during an extended operation. Fifteen hundred Amazon employees later petitioned for access to Claude Code instead, calling the internal Kiro mandate "forced." The failure cascaded not because the model reasoned poorly but because the context that would have prevented the cascade had already been evicted. Amazon's post-mortem blamed the human operators. The context window was not mentioned.

Then there is OpenClaw issue #21653. A developer documented that OpenClaw's default context window allocation of 4,096 tokens was starving agents of the information they needed to complete multi-step tasks. The agents would begin a task, execute two or three steps, then lose access to the results from step one because the framework's own overhead consumed most of the available window. The framework's orchestration layer was competing with the agent's working memory for the same fixed resource. The orchestration layer was winning.

The Abstraction Tax

This is not a bug in any single framework. It is a design philosophy that treats orchestration as the hard problem and context as a solved one.

AutoGPT, the project that catalyzed the autonomous agent wave in 2023, ships with a working memory of roughly 4,000 words and no persistent command history. Its agents hit a ceiling at four to five sequential actions before they start looping, repeating steps they have already taken because the memory of those steps has been evicted. Six GitHub issues document the pattern. Users describe agents that "forget" mid-task, restart from scratch, and burn through API credits retracing their own steps. The root cause is architectural: AutoGPT allocates its context budget to the orchestration loop itself, to the goal-tracking scaffolding and the action-selection prompts, leaving the agent's actual working memory as whatever tokens remain. The quadratic scaling of attention means that each additional step costs disproportionately more context, and the framework offers no mechanism to manage the tradeoff.

Production systems that have pushed past the toy stage confirm the pattern from the other direction. Manus, one of the few agent platforms with published production metrics, reports that its agents average 50 tool calls per session and consume over 128,000 tokens per run. The input-to-output ratio is 100:1. For every token of useful output, the model ingests 100 tokens of context. That ratio is the actual cost structure of agentic AI, and no amount of orchestration optimization touches it.

One developer reported a $12,000 OpenAI bill from a single unmonitored LangChain deployment where recursive chains consumed tokens without producing useful output. The framework provided no guardrails for context consumption because context consumption was not part of its abstraction model.

History Rhymes

LangChain's own team eventually acknowledged the problem. Harrison Chase, LangChain's CEO, called Hacker News criticism of the framework "level-headed and precise." The company then pivoted to LangGraph, a lower-level library that gives developers more direct control over state management and context flow. The pivot was an implicit admission: the abstraction layer that made LangChain popular was also the layer that made it fail in production.

Kieran Klaassen, writing in Every Inc, put it more directly: "LangChain is where good AI projects go to die."

Patrick Degenhardt and the Octomind team documented their own exit. After building their testing platform on LangChain, they concluded they had spent "as much time debugging LangChain as building features." They migrated to direct API calls. Their system got simpler and their context management improved, because they could finally see what was going into the prompt.

OpenClaw is now repeating the cycle. Fifty-two modules, a plugin marketplace, an orchestration layer that adds its own context overhead to every agent interaction. The pattern is recognizable to anyone who watched the last round: a framework that optimizes for developer experience at the cost of the resource the model actually needs.

Gartner surveyed 3,412 enterprise AI projects in early 2026 and projected that more than 40% of agentic AI initiatives will be canceled by 2027. Of the thousands of vendors in the space, the analysts identified roughly 130 as viable. The rest are shipping orchestration tooling for a problem that does not survive contact with production context limits. The cancellation rate is not a failure of AI capability. It is a failure of the infrastructure layer that sits between the capability and the task.

A deep canyon viewed at a three-quarter angle. Two bridges span it side by side: on the left, an enormous suspension bridge bristling with routing nodes, module boxes, and control panels — impressive but visibly sagging under its own weight. On the right, a single rope bridge, minimal and clean, spanning the same gap with one-tenth the infrastructure. A small thinking octopus stands at the near edge between the two bridges, tentacles reaching toward each but choosing neither, looking at the faint warm glow on the far side.

The Question the Frameworks Cannot Answer

If context is the bottleneck, what would a context-first framework even look like?

NanoClaw offered one answer in February: 500 lines of TypeScript, no plugin system, container isolation per agent. It collected 7,000 GitHub stars in its first week. The design philosophy is radical minimalism: instead of managing context through abstractions, NanoClaw stays out of the context window entirely and lets the model use its full budget on the actual task. But NanoClaw is five weeks old with zero production validation. Enthusiasm is not evidence.

The harder version of the question is this: what if the right number of framework layers between a developer and the model's context window is zero? What if every abstraction that touches the prompt is, by definition, competing with the information the agent actually needs to do its job?

The Manus numbers suggest a ceiling that most framework authors have not confronted. At 100:1 input-to-output ratios and 128K-token sessions, the context window is not a resource to be managed by a middleware layer. It is the product. Everything that enters the window displaces something else. Every framework header, every tool schema, every routing instruction is a paragraph of working memory the agent will never have.

The 34-point drop from sharded context is not a benchmark curiosity. It is the cost function of every agent framework that treats context as someone else's problem.

The Bottleneck Nobody Benchmarks

Three Agents, Three Disasters

The Abstraction Tax

History Rhymes

The Question the Frameworks Cannot Answer

Enjoyed this essay?