How I Improved an AI Agent from 40% to 60% — With A/B Test Data

Same model, same test cases, 20% better results. 7 out of 8 fixes were pure code, zero LLM cost. Here's exactly what I changed and why it worked.

May 12, 2026

Harrison Guo

7 min read

The Setup

I was optimizing an AI agent for a production system — a creator agent that handles user requests like “make this character fiercer” or “rename this entity.” The agent runs a 5-layer pipeline: Perceive → Cognate → Decide → Act → Express, with real LLM calls at each step.

Quality was bad. Not “it doesn’t work” bad — “it works 40% of the time” bad. The remaining 60% were wrong entity targeting, infinite reasoning loops, and silent failures.

I ran 5 standardized test cases, each repeated 5 times (LLMs are non-deterministic), measuring pass rate:

Test	What It Does	Baseline
QL-001	Create 4 entities + 1 relationship in one message	0%
QL-002	Classify user intent correctly	80%
QL-003	Update the right entity in a world with 6 characters + 4 locations	40%
QL-004	Maintain context across long conversation	100%
QL-005	Simple rename (“Ember” → “Infernia”)	20%

Test QL-001

What It Does Create 4 entities + 1 relationship in one message

Baseline 0%

Test QL-002

What It Does Classify user intent correctly

Baseline 80%

Test QL-003

What It Does Update the right entity in a world with 6 characters + 4 locations

Baseline 40%

Test QL-004

What It Does Maintain context across long conversation

Baseline 100%

Test QL-005

What It Does Simple rename (“Ember” → “Infernia”)

Baseline 20%

Overall: 40% pass rate. The model (equivalent to GPT-4 class) was plenty capable. Something else was wrong.

The Diagnosis: Context Was the Problem

QL-003: Why the Agent Confused Entities (40% → 80%)

The user says: “Make Ember more fierce and give her fire breath.”

The world has 10 entities: 6 characters (Ember, Luna, Grak, Roland, Mira, Pip) and 4 locations. The agent’s BuildChatCompletionMessages function dumped ALL entity data into the prompt — every character’s backstory, every location’s description.

The LLM had to find Ember in a wall of irrelevant text. Sometimes it picked Luna. Sometimes it referenced the wrong character’s traits. Not because the model was stupid — because the context was noisy.

QL-005: Why Simple Rename Failed (20% → 80%)

“Rename Ember to Infernia.” One entity, one operation. Should be trivial.

Two problems:

No round limit — the agent sometimes looped 15+ times on a rename, reasoning tools firing endlessly
When a tool failed, the LLM got: {"error": true, "message": "This tool is temporarily unavailable."} — no context on what to do next

The model gave up or produced responses that didn’t contain “Infernia.”

QL-001: Why Multi-Step Creation Was Impossible (0% → 0%)

“Create a dragon named Ember who lives in Crystal Caves. Ember has a rivalry with Sir Roland who guards the village gate.”

This requires creating 4 entities + 1 relationship. The 5-layer pipeline processes entities sequentially, each in isolation. The relationship creation doesn’t know the knight was just created — there’s no shared state between action steps.

Both baseline and improved scored 0%. This is an architectural problem, not a context problem.

The Fixes: 8 Changes, 7 Pure Code

Fix 1: PlanExecution (the only LLM call)

One API call before the main loop. The LLM generates a plan:

Goal: Update Ember's properties
Steps: 1. Identify Ember entity  2. Apply personality changes
Tools needed: updateCharacter

This plan gets injected into the cognition layer’s context. The intent classifier now sees a roadmap, not just raw entity data.

Cost: ~$0.003 per request, 3-5s latency. The only fix that uses an LLM call.

Fix 2: PrioritizeContext (pure code)

Sort context items by salience score. Higher-relevance items go first. Low-relevance items dropped when the token budget is exceeded.

When the user says “Make Ember fiercer,” Ember’s data gets priority. Luna’s backstory gets dropped. The LLM sees signal, not noise.

sort.Slice(items, func(i, j int) bool {
    return items[i].Salience > items[j].Salience
})
items = items[:tokenBudget]

Cost: Zero. Pure sort + filter.

Fix 3: CompressContext (pure code)

Old conversation rounds get summarized extractively — find tool names, find CONCLUSION markers, truncate the rest. No LLM needed for this level of compression.

Cost: Zero. String operations.

Fix 4: Preserve Conclusions (pure code)

When reasoning text is truncated at 4,000 characters, the truncation used to cut wherever it landed. If the LLM decided “I need to rename Ember to Infernia” in round 1 but that conclusion was at character 4,100, round 2 forgot the decision.

Fix: truncateReasoningPreservingConclusions() finds CONCLUSION/DECISION markers and keeps them even when truncating.

Cost: Zero. String search.

Fix 5: Max Rounds Cap (pure code)

const DefaultMaxRounds = 10
if roundCount > DefaultMaxRounds { break }

Previously unlimited. The agent sometimes looped 15+ rounds on a trivial task. Now it stops at 10 and produces its best result.

Cost: Zero. One if-statement.

Fix 6: Structured Tool Errors (pure code)

Before:

{"error": true, "tool_name": "updateCharacter", "message": "This tool is temporarily unavailable."}

After:

{"error": true, "tool_name": "updateCharacter", "message": "This tool is temporarily unavailable.",
 "error_type": "timeout", "retryable": true}

With retryable: true, the LLM knows to try again instead of giving up. With error_type: "timeout", it knows the issue is transient.

Cost: Zero. String classification.

Fix 7: Circuit Breaker (pure code)

Count failures per LLM provider. After 3 consecutive failures, skip that provider and try the fallback. Prevents the agent from burning through 120 seconds of timeout on a dead provider.

Cost: Zero. Counter + threshold.

Fix 8: HTTP Client Reuse (pure code)

Store *http.Client on the provider struct, reuse across calls. Previously each call created a new client, a new TCP connection, a new TLS handshake.

Cost: Zero. Struct field.

The Results

Test	Baseline	After Fix	Delta	What Fixed It
QL-001	0%	0%	=	Needs pipeline architecture change
QL-002	80%	80%	=	Already working
QL-003	40%	80%	+40%	PrioritizeContext + PlanExecution
QL-004	100%	100%	=	Already working
QL-005	20%	80%	+60%	Max rounds + structured errors + conclusion preservation

Test QL-001

Baseline 0%

After Fix 0%

Delta =

What Fixed It Needs pipeline architecture change

Test QL-002

Baseline 80%

After Fix 80%

Delta =

What Fixed It Already working

Test QL-003

Baseline 40%

After Fix 80%

Delta +40%

What Fixed It PrioritizeContext + PlanExecution

Test QL-004

Baseline 100%

After Fix 100%

Delta =

What Fixed It Already working

Test QL-005

Baseline 20%

After Fix 80%

Delta +60%

What Fixed It Max rounds + structured errors + conclusion preservation

Overall: 40% → 60%. Same model. Better input.

Latency went from 26s to 43s due to the PlanExecution LLM call (~3-5s per test). The HTTP reuse and circuit breaker savings show up under concurrent load, not in a 5-test sequential run.

What Didn’t Improve — And Why

QL-001 (multi-step creation) stayed at 0%. This isn’t a context problem — it’s a pipeline architecture problem. Each entity is created in isolation, and the IDs returned by each step are discarded before the next step runs:

flowchart LR
    U["User: Create Ember + Roland
+ rivalry between them"]
    U --> S1["Step 1
createCharacter(Ember)
→ returns dragon_id_42"]
    S1 -. state discarded .-> S2["Step 2
createCharacter(Roland)
→ returns knight_id_77"]
    S2 -. state discarded .-> S3["Step 3
createRelationship(?, ?)
no IDs available"]
    S3 --x F["Relationship fails
QL-001: 0% pass rate"]

Fixing this requires collapsing the 5-layer pipeline into a unified agent with cross-step state — a larger architectural change, not a context fix.

The lesson: Context optimization has a ceiling. Past that ceiling, you need architecture changes. But the ceiling is higher than most people think — we still had 20% improvement available before hitting it.

What’s Still Missing

Three pieces of infrastructure were built but not wired:

Component	Status	Gap
VerifyOutput	Logs quality issues	Doesn’t retry on failure
ScoreMemoryUsage	Computes relevance scores	Scores never applied to future retrieval
PlanExecution	Generates plan before loop	Plan not tracked during execution

Component VerifyOutput

Status Logs quality issues

Gap Doesn’t retry on failure

Component ScoreMemoryUsage

Status Computes relevance scores

Gap Scores never applied to future retrieval

Component PlanExecution

Status Generates plan before loop

Gap Plan not tracked during execution

All three are open loops. The infrastructure detects problems but doesn’t act on them. Closing these loops is the next 20% — getting from 60% to 80%+.

The Takeaway

Better input → better output. The LLM is the same.

If your agent is underperforming, check the context before blaming the model. In our case:

7 out of 8 fixes were pure code
Zero additional LLM cost (except one planning call at $0.003)
20% quality improvement without changing the model
The model was always capable — the context was holding it back

The highest-ROI investment in any agent system is context management. It’s not glamorous. It’s sort, filter, compress, truncate, prioritize. But it’s the difference between 40% and 60% — and the foundation for everything else.

Part of the AI Agent Architecture series. See also: The 90% Problem for the broader framework, and Claude Code Deep Dive Part 3 for how Anthropic solves context at scale.

🎧 More Ways to Consume This Content

AI Operator Deep Dive Podcast

I occasionally advise small teams on backend reliability, Go performance, and production AI systems. Learn more: /services

Comments

This space is waiting for your voice.

Comments will be supported shortly. Stay connected for updates!

Preview of future curated comments

This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.

How I Improved an AI Agent from 40% to 60% — With A/B Test Data

Same model, same test cases, 20% better results. 7 out of 8 fixes were pure code, zero LLM cost. Here's exactly what I changed and why it worked.

Table of Contents

The Setup

The Diagnosis: Context Was the Problem

QL-003: Why the Agent Confused Entities (40% → 80%)

QL-005: Why Simple Rename Failed (20% → 80%)

QL-001: Why Multi-Step Creation Was Impossible (0% → 0%)

The Fixes: 8 Changes, 7 Pure Code

Fix 1: PlanExecution (the only LLM call)

Fix 2: PrioritizeContext (pure code)

Fix 3: CompressContext (pure code)

Fix 4: Preserve Conclusions (pure code)

Fix 5: Max Rounds Cap (pure code)

Fix 6: Structured Tool Errors (pure code)

Fix 7: Circuit Breaker (pure code)

Fix 8: HTTP Client Reuse (pure code)

The Results

What Didn’t Improve — And Why

What’s Still Missing

The Takeaway

🎧 More Ways to Consume This Content

Comments

Leave a Comment

How I Improved an AI Agent from 40% to 60% — With A/B Test Data

Same model, same test cases, 20% better results. 7 out of 8 fixes were pure code, zero LLM cost. Here's exactly what I changed and why it worked.

Table of Contents

The Setup

The Diagnosis: Context Was the Problem

QL-003: Why the Agent Confused Entities (40% → 80%)

QL-005: Why Simple Rename Failed (20% → 80%)

QL-001: Why Multi-Step Creation Was Impossible (0% → 0%)

The Fixes: 8 Changes, 7 Pure Code

Fix 1: PlanExecution (the only LLM call)

Fix 2: PrioritizeContext (pure code)

Fix 3: CompressContext (pure code)

Fix 4: Preserve Conclusions (pure code)

Fix 5: Max Rounds Cap (pure code)

Fix 6: Structured Tool Errors (pure code)

Fix 7: Circuit Breaker (pure code)

Fix 8: HTTP Client Reuse (pure code)

The Results

What Didn’t Improve — And Why

What’s Still Missing

The Takeaway

🎧 More Ways to Consume This Content

[ Agent_Architecture_Notes ]

Related Articles

The 90% Problem: Why Most AI Agents Are Still Broken

Testing Real-World Go Backends Isn't What Many People Think

Don't Pick One AI. Run Three Against Each Other.

Comments

Leave a Comment

[ Connect_With_Me ]