How I Improved an AI Agent from 40% to 60% — With A/B Test Data
Same model, same test cases, 20% better results. 7 out of 8 fixes were pure code, zero LLM cost. Here's exactly what I changed and why it worked.
Table of Contents
The Setup
I was optimizing an AI agent for a production system — a creator agent that handles user requests like “make this character fiercer” or “rename this entity.” The agent runs a 5-layer pipeline: Perceive → Cognate → Decide → Act → Express, with real LLM calls at each step.
Quality was bad. Not “it doesn’t work” bad — “it works 40% of the time” bad. The remaining 60% were wrong entity targeting, infinite reasoning loops, and silent failures.
I ran 5 standardized test cases, each repeated 5 times (LLMs are non-deterministic), measuring pass rate:
| Test | What It Does | Baseline |
|---|---|---|
| QL-001 | Create 4 entities + 1 relationship in one message | 0% |
| QL-002 | Classify user intent correctly | 80% |
| QL-003 | Update the right entity in a world with 6 characters + 4 locations | 40% |
| QL-004 | Maintain context across long conversation | 100% |
| QL-005 | Simple rename (“Ember” → “Infernia”) | 20% |
Overall: 40% pass rate. The model (equivalent to GPT-4 class) was plenty capable. Something else was wrong.
The Diagnosis: Context Was the Problem
QL-003: Why the Agent Confused Entities (40% → 80%)
The user says: “Make Ember more fierce and give her fire breath.”
The world has 10 entities: 6 characters (Ember, Luna, Grak, Roland, Mira, Pip) and 4 locations. The agent’s BuildChatCompletionMessages function dumped ALL entity data into the prompt — every character’s backstory, every location’s description.
The LLM had to find Ember in a wall of irrelevant text. Sometimes it picked Luna. Sometimes it referenced the wrong character’s traits. Not because the model was stupid — because the context was noisy.
QL-005: Why Simple Rename Failed (20% → 80%)
“Rename Ember to Infernia.” One entity, one operation. Should be trivial.
Two problems:
- No round limit — the agent sometimes looped 15+ times on a rename, reasoning tools firing endlessly
- When a tool failed, the LLM got:
{"error": true, "message": "This tool is temporarily unavailable."}— no context on what to do next
The model gave up or produced responses that didn’t contain “Infernia.”
QL-001: Why Multi-Step Creation Was Impossible (0% → 0%)
“Create a dragon named Ember who lives in Crystal Caves. Ember has a rivalry with Sir Roland who guards the village gate.”
This requires creating 4 entities + 1 relationship. The 5-layer pipeline processes entities sequentially, each in isolation. The relationship creation doesn’t know the knight was just created — there’s no shared state between action steps.
Both baseline and improved scored 0%. This is an architectural problem, not a context problem.
The Fixes: 8 Changes, 7 Pure Code
Fix 1: PlanExecution (the only LLM call)
One API call before the main loop. The LLM generates a plan:
Goal: Update Ember's properties
Steps: 1. Identify Ember entity 2. Apply personality changes
Tools needed: updateCharacter
This plan gets injected into the cognition layer’s context. The intent classifier now sees a roadmap, not just raw entity data.
Cost: ~$0.003 per request, 3-5s latency. The only fix that uses an LLM call.
Fix 2: PrioritizeContext (pure code)
Sort context items by salience score. Higher-relevance items go first. Low-relevance items dropped when the token budget is exceeded.
When the user says “Make Ember fiercer,” Ember’s data gets priority. Luna’s backstory gets dropped. The LLM sees signal, not noise.
sort.Slice(items, func(i, j int) bool {
return items[i].Salience > items[j].Salience
})
items = items[:tokenBudget]
Cost: Zero. Pure sort + filter.
Fix 3: CompressContext (pure code)
Old conversation rounds get summarized extractively — find tool names, find CONCLUSION markers, truncate the rest. No LLM needed for this level of compression.
Cost: Zero. String operations.
Fix 4: Preserve Conclusions (pure code)
When reasoning text is truncated at 4,000 characters, the truncation used to cut wherever it landed. If the LLM decided “I need to rename Ember to Infernia” in round 1 but that conclusion was at character 4,100, round 2 forgot the decision.
Fix: truncateReasoningPreservingConclusions() finds CONCLUSION/DECISION markers and keeps them even when truncating.
Cost: Zero. String search.
Fix 5: Max Rounds Cap (pure code)
const DefaultMaxRounds = 10
if roundCount > DefaultMaxRounds { break }
Previously unlimited. The agent sometimes looped 15+ rounds on a trivial task. Now it stops at 10 and produces its best result.
Cost: Zero. One if-statement.
Fix 6: Structured Tool Errors (pure code)
Before:
{"error": true, "tool_name": "updateCharacter", "message": "This tool is temporarily unavailable."}
After:
{"error": true, "tool_name": "updateCharacter", "message": "This tool is temporarily unavailable.",
"error_type": "timeout", "retryable": true}
With retryable: true, the LLM knows to try again instead of giving up. With error_type: "timeout", it knows the issue is transient.
Cost: Zero. String classification.
Fix 7: Circuit Breaker (pure code)
Count failures per LLM provider. After 3 consecutive failures, skip that provider and try the fallback. Prevents the agent from burning through 120 seconds of timeout on a dead provider.
Cost: Zero. Counter + threshold.
Fix 8: HTTP Client Reuse (pure code)
Store *http.Client on the provider struct, reuse across calls. Previously each call created a new client, a new TCP connection, a new TLS handshake.
Cost: Zero. Struct field.
The Results
| Test | Baseline | After Fix | Delta | What Fixed It |
|---|---|---|---|---|
| QL-001 | 0% | 0% | = | Needs pipeline architecture change |
| QL-002 | 80% | 80% | = | Already working |
| QL-003 | 40% | 80% | +40% | PrioritizeContext + PlanExecution |
| QL-004 | 100% | 100% | = | Already working |
| QL-005 | 20% | 80% | +60% | Max rounds + structured errors + conclusion preservation |
Overall: 40% → 60%. Same model. Better input.
Latency went from 26s to 43s due to the PlanExecution LLM call (~3-5s per test). The HTTP reuse and circuit breaker savings show up under concurrent load, not in a 5-test sequential run.
What Didn’t Improve — And Why
QL-001 (multi-step creation) stayed at 0%. This isn’t a context problem — it’s a pipeline architecture problem. Each entity is created in isolation with no shared state:
Create dragon → (no state) → Create knight → (no state) → Create relationship
↑ doesn't know dragon's ID
Fixing this requires collapsing the 5-layer pipeline into a unified agent with cross-step state — a larger architectural change, not a context fix.
The lesson: Context optimization has a ceiling. Past that ceiling, you need architecture changes. But the ceiling is higher than most people think — we still had 20% improvement available before hitting it.
What’s Still Missing
Three pieces of infrastructure were built but not wired:
| Component | Status | Gap |
|---|---|---|
| VerifyOutput | Logs quality issues | Doesn’t retry on failure |
| ScoreMemoryUsage | Computes relevance scores | Scores never applied to future retrieval |
| PlanExecution | Generates plan before loop | Plan not tracked during execution |
All three are open loops. The infrastructure detects problems but doesn’t act on them. Closing these loops is the next 20% — getting from 60% to 80%+.
The Takeaway
Better input → better output. The LLM is the same.
If your agent is underperforming, check the context before blaming the model. In our case:
- 7 out of 8 fixes were pure code
- Zero additional LLM cost (except one planning call at $0.003)
- 20% quality improvement without changing the model
- The model was always capable — the context was holding it back
The highest-ROI investment in any agent system is context management. It’s not glamorous. It’s sort, filter, compress, truncate, prioritize. But it’s the difference between 40% and 60% — and the foundation for everything else.
Part of the AI Agent Architecture series. See also: The 90% Problem for the broader framework, and Claude Code Deep Dive Part 3 for how Anthropic solves context at scale.
Comments
This space is waiting for your voice.
Comments will be supported shortly. Stay connected for updates!
This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.
Have questions? Reach out through:
Want to see your comment featured? Mention us on X or tag us on Reddit.
Leave a Comment