The 90% Problem: Why Most AI Agents Are Still Broken

Building an AI agent that works is easy. Building one that doesn't break is 90% of the work. Here's what that 90% actually looks like — from leaked source code and production A/B data.

April 17, 2026

Harrison Guo

9 min read

The 90% Problem

Your Agent Works Great. Until It Doesn’t.

You built an AI agent over the weekend. It calls tools, remembers context, follows instructions. You demo it to your team. Everyone’s impressed.

Monday morning, a user types “rename Ember to Infernia.” Your agent loops 15 times, burns through your API budget, and returns a response that doesn’t contain the word “Infernia.” A rename. One entity. One operation.

I’ve been there. I ran an eval suite on a production agent — 5 test cases, 5 runs each. Pass rate: 40%. Not on hard tasks. On things like “update the right character out of six” and “rename one entity.” The model was GPT-4 class. Plenty capable. The problem was everything around the model.

This is the 90% problem:

Building the core loop (perceive → reason → act):  10% of the work
Making it not break in production:                  90% of the work

It took me a while to see where the problem actually was. The real gap wasn’t missing features. It was open loops: verification that doesn’t retry, memory that doesn’t decay, compression that doesn’t circuit-break.

Here’s what I found — from analyzing Claude Code’s leaked source, where a 1,729-line query.ts file contains a 1,421-line while(true) loop inside a roughly 512,000-line codebase, and from fixing a production agent’s pass rate with code changes alone. No model upgrade. No prompt magic. Just engineering.

The Five Pillars — And Where Agents Actually Fail

Every production agent needs five things. Most only build two of them well.

Pillar	What It Does	Who Does the Work	Most Agents’ Status
Context Management	What the LLM sees	Code orchestrates; LLM helps compress	Dump everything, hope for the best
Memory Management	What persists across sessions	Code orchestrates; LLM helps recall	Basic store/retrieve, no lifecycle
Reflection	Agent checks its own output	Code triggers; LLM judges	Not implemented or logs-only
Planning	Agent thinks before acting	LLM (decompose tasks, sequence steps)	Decent — LLMs are good at this
Tool Use	Agent interacts with the world	LLM selects, Code executes	Decent — most mature pillar

Pillar Context Management

What It Does What the LLM sees

Who Does the Work Code orchestrates; LLM helps compress

Most Agents’ Status Dump everything, hope for the best

Pillar Memory Management

What It Does What persists across sessions

Who Does the Work Code orchestrates; LLM helps recall

Most Agents’ Status Basic store/retrieve, no lifecycle

Pillar Reflection

What It Does Agent checks its own output

Who Does the Work Code triggers; LLM judges

Most Agents’ Status Not implemented or logs-only

Pillar Planning

What It Does Agent thinks before acting

Who Does the Work LLM (decompose tasks, sequence steps)

Most Agents’ Status Decent — LLMs are good at this

Pillar Tool Use

What It Does Agent interacts with the world

Who Does the Work LLM selects, Code executes

Most Agents’ Status Decent — most mature pillar

The Five Pillars of Agent Architecture

Planning and Tool Use work reasonably well because they ride on model improvements. GPT-3.5 struggled with tool calling; Claude Opus 4.6 is reliable. You get these improvements for free with model upgrades.

Context and Memory are where agents fail because they’re engineering problems, not model problems. Reflection sits in the middle: the LLM can judge quality, but code still has to trigger that check, route the result, and do something with it. No model upgrade will fix a context pipeline that dumps 10 irrelevant entities into the prompt.

The LLM vs Code Divide

This is the most important insight for anyone building agents:

HIGH LLM dependence (improves with better models):
  Planning        → LLM generates the plan
  Reflection      → LLM evaluates quality
  Tool Selection  → LLM picks the right tool

LOW LLM dependence (never improves from model upgrades):
  Context Management  → Code sorts, filters, compresses
  Memory Management   → Code stores, retrieves, scores, decays
  Error Handling      → Code classifies errors, retries, circuit-breaks
  Tool Execution      → Code runs tools, parallelizes, batches
  State Management    → Code tracks progress, checkpoints

LLM vs Code Divide

But low LLM dependence does not mean zero model calls. It means the failure mode is mostly in the orchestration. Even code-dominated pillars still use models in three very different ways:

Three Ways Agents Call LLMs

1. Direct LLM call (sideQuery) — Code asks a narrow question, takes the answer, and moves on. Example: Claude Code’s memory recall uses a single Sonnet side-query to choose 5 relevant memories from roughly 200 files.

2. Forked sub-agent — Code delegates a bounded task to a child agent with its own context, tools, and loop. Example: Claude Code’s autocompact hands summarization to a child agent instead of forcing the main loop to do it inline.

3. Tool-use loop — The LLM decides which tool to call, the program executes it, and the result flows back into the next turn. This is the main agent loop.

Simple question (which memories are relevant?)  → Direct call
Complex but bounded task (summarize this)       → Forked sub-agent
Open-ended execution (build this feature)       → Tool-use loop

This choice is not academic. It changes latency, token cost, and failure modes. In Claude Code’s memory system, a side-query is cheap. A forked summarizer is much heavier. Using the wrong pattern wastes budget or hurts reliability.

The trap: Teams chase model upgrades (“let’s switch to Claude Opus”) instead of fixing their context pipeline. Better models help — but in my experience, fixing the context pipeline delivers a larger improvement per dollar than upgrading the model.

In one production system, fixing context management alone — without changing the model — moved quality from 40% to 60%. Seven out of eight fixes were pure code, zero LLM cost. The model was always capable. The context was holding it back.

What 90% Actually Looks Like — From Claude Code’s Source

The Anatomy of Production: Claude Code

Claude Code’s leaked source is roughly 512,000 lines. Here’s the useful way to think about that split:

query.ts orchestrator file:             1,729 lines    (~0.3%)
Core while(true) loop inside it:        1,421 lines
Everything else:                     ~510,000 lines    (~99.7%)

That “everything else” is the 90%:

Context Management (3,960 lines in src/services/compact/):

5-level progressive compression pipeline
Microcompact with dual code paths based on cache state
Token estimation without API calls (<5% error)
Post-compression recovery (restore last 5 files, skills, agent state)
Circuit breaker: 3 consecutive failures → stop (after 250K API calls/day were wasted without it)

Memory System (1,736 lines in src/memdir/):

4-type closed taxonomy with structured frontmatter
Sonnet side-query for semantic retrieval (250ms, async prefetch)
Background extraction agent with mutual exclusion
Trust verification (eval went 0/2 → 3/3 with this addition)

Error Handling (spread across entire codebase):

Message normalization: fix orphan tool_use/tool_result pairs from crashes
Prompt-Too-Long recovery: reactive compression as last resort
Tool failure classification: timeout vs permission vs not-found
Max output token escalation: 8K default → 64K on truncation

Permission System (multi-layer):

Tool-level risk classification
User confirmation for dangerous operations
Sandbox isolation for high-risk tools
Context injection scanning

None of this is intellectually exciting. It’s plumbing. But without it, the “exciting” part — the agent loop — crashes on every non-trivial conversation.

OpenClaw and Hermes Surface the Same Pattern

Two open-source agents worth watching right now — OpenClaw and Hermes Agent — illustrate the same architectural lesson.

OpenClaw:

Context management: still basic; I don’t see the kind of progressive compression Claude Code built
Memory: Markdown + SQLite (more sophisticated than Claude Code’s storage layer)
Reflection: limited; I don’t yet see a strong closed verification loop
Security: public reports in early 2026 highlighted exposed instances and malicious marketplace skills; openclaw security audit exists, but tools alone don’t close the operational loop

Hermes Agent:

Context management: still basic
Memory: SQLite + full-text search + MEMORY.md dual-layer
Reflection: self-evolving skills generated from completed tasks
Error handling: layered on paper, but still early
Maturity: promising, but I haven’t seen evidence yet that the self-iteration loop holds up at scale

Both can complete tasks. My point is not that they don’t work. It’s that the hardest production loops — compression, failure accounting, verification retries, and memory hygiene — still look only partially closed. The features exist; the loops aren’t closed.

Features Exist vs Loops Are Closed

“Features Exist” vs “Loops Are Closed”

This is the most overlooked distinction in agent architecture:

Open loop:   Build a verification step → log issues → done
Closed loop: Build a verification step → log issues → retry with feedback → verify again

Open loop:   Score memory relevance → store the score → done  
Closed loop: Score memory relevance → reinforce high-scoring memories → decay low-scoring → improve retrieval over time

Open loop:   Detect compression failure → log it → continue
Closed loop: Detect compression failure → count consecutive failures → circuit-break after 3

In every open-source agent I’ve analyzed so far, some of the critical loops are still open. The infrastructure is there. The wiring is only partially connected.

Here’s the test: look at your agent’s verification step. Does it log a failure and move on? That’s an open loop. Does it log, retry with the failure as feedback, and verify again? That’s closed. The difference is one if statement and a retry call — but it’s the difference between “we have quality checks” and “we actually catch errors before users see them.”

This is the hardest 10% of the 90%. Not building the infrastructure — connecting it.

The Proof: 40% to 60%

The Proof: 40% to 60% With Code Alone

I ran A/B evals on a production agent. Same model, same test cases, different code. Result: 40% → 60% pass rate.

The breakdown: 8 fixes total. 7 were pure code — zero LLM cost. Context prioritization, structured error classification, round limits, conclusion preservation during truncation, circuit breakers. The only fix that used an LLM call was a pre-loop planning step at $0.003 per request.

The model was always capable. The context was holding it back.

(Full case study with per-test breakdown: How I Improved an AI Agent from 40% to 60%)

What This Means for Builders

If you’re building an AI agent:

Don’t start with the model. Start with context management. Clean, prioritized, bounded input is the highest-leverage investment you can make.
Close your loops. If you built a verification step, make it retry. If you built memory scoring, wire the reinforcement. Half-built infrastructure is worse than none — it gives false confidence.
Measure before you upgrade. Before switching to a more expensive model, run an eval suite on your current one. The bottleneck is probably context, not capability.
Budget 90% of your time for the 90%. The agent loop is a weekend project. Error handling, compression, memory lifecycle, permission systems — that’s the real work. Plan accordingly.

The model is a commodity. The engineering around it is the product.

Ask yourself: what percentage of your agent’s codebase is the core loop, and what percentage is everything else? If you don’t know the answer, that’s where to start.

Diagrams from this essay packaged as a single-file reference: Engineering Reliable Agents (PDF).

Part of the AI Agent Architecture series.

Deep dives into the 90%:

Claude Code Part 3: The 5-Level Compression Pipeline — how Anthropic solved context management
Claude Code Part 4: Why Markdown Instead of Vector DBs — first-principles memory tradeoffs
40% to 60% With A/B Data — the full case study behind the numbers in this article

🎧 More Ways to Consume This Content

AI Operator Deep Dive Podcast

I occasionally advise small teams on backend reliability, Go performance, and production AI systems. Learn more: /services

Comments

This space is waiting for your voice.

Comments will be supported shortly. Stay connected for updates!

Preview of future curated comments

This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.

The 90% Problem: Why Most AI Agents Are Still Broken

Building an AI agent that works is easy. Building one that doesn't break is 90% of the work. Here's what that 90% actually looks like — from leaked source code and production A/B data.

Table of Contents

Your Agent Works Great. Until It Doesn’t.

The Five Pillars — And Where Agents Actually Fail

The LLM vs Code Divide

What 90% Actually Looks Like — From Claude Code’s Source

OpenClaw and Hermes Surface the Same Pattern

“Features Exist” vs “Loops Are Closed”

The Proof: 40% to 60% With Code Alone

What This Means for Builders

🎧 More Ways to Consume This Content

Comments

Leave a Comment

The 90% Problem: Why Most AI Agents Are Still Broken

Building an AI agent that works is easy. Building one that doesn't break is 90% of the work. Here's what that 90% actually looks like — from leaked source code and production A/B data.

Table of Contents

Your Agent Works Great. Until It Doesn’t.

The Five Pillars — And Where Agents Actually Fail

The LLM vs Code Divide

What 90% Actually Looks Like — From Claude Code’s Source

OpenClaw and Hermes Surface the Same Pattern

“Features Exist” vs “Loops Are Closed”

The Proof: 40% to 60% With Code Alone

What This Means for Builders

🎧 More Ways to Consume This Content

[ Agent_Architecture_Notes ]

Related Articles

Claude Code Deep Dive Part 4: Why It Uses Markdown Files Instead of Vector DBs

Claude Code Deep Dive Part 2: The 1,421-Line While Loop That Runs Everything

Claude Code MEMORY.md Spec: The 4 Frontmatter Types Decoded (user / feedback / project / reference)

Comments

Leave a Comment

[ Connect_With_Me ]