Claude Code Deep Dive Part 2: The 1,421-Line While Loop That Runs Everything

Inside query.ts — the 1,729-line async generator that is Claude Code's beating heart. 10 steps per iteration, 9 continue points, 4-stage compression, and streaming tool execution. With line numbers.

April 3, 2026
Harrison Guo
9 min read

This is Part 2 of our Claude Code Architecture Deep Dive series. Part 1: 5 Hidden Features covered the surface-level discoveries. Now we go deeper.

The Heart of Claude Code

Every AI coding agent — Claude Code, Cursor, Copilot — runs some version of the same loop: send context to an LLM, get back text and tool calls, execute tools, feed results back, repeat. We called this LLM talks, program walks.

But Claude Code’s implementation of this loop is anything but simple. It lives in query.ts, a 1,729-line async generator. The while(true) starts at line 307 and ends at line 1728 — a single loop body spanning 1,421 lines of production code.

This is not a toy. This is the engine that processes every keystroke, every tool call, every error recovery, every context compression decision for millions of users.

// query.ts — line 307
// eslint-disable-next-line no-constant-condition
while (true) {
    let { toolUseContext } = state
    const { ... } = state
    // ... 1,421 lines of state machine logic ...
    state = next
} // while (true)  — line 1728

Why a State Machine, Not Recursion

Early versions of Claude Code used recursion — the query function called itself. But recursion has a fatal flaw: in long conversations with hundreds of tool calls, the call stack grows until it explodes.

The current design uses while(true) with a state object that carries context between iterations:

// query.ts — lines 207-215 (State type, partial)
autoCompactTracking: AutoCompactTrackingState | undefined
maxOutputTokensRecoveryCount: number
hasAttemptedReactiveCompact: boolean       // circuit breaker for 413 recovery
stopHookActive: boolean | undefined
turnCount: number
transition: { reason: string } | undefined // why we continued

Each continue statement is a state transition. There are 9 distinct continue points in the code (lines 950, 1115, 1165, 1220, 1251, 1305, 1316, 1340), each representing a different reason to run another turn:

  • Next tool call needed
  • Reactive compact triggered after 413
  • Max output tokens recovery
  • Stop hook interrupted
  • Token budget continuation
  • And more

The Loop at a Glance

flowchart TD
    A["① Compress Context
(4 stages)"] --> B["② Token Budget Check"] B --> C["③ Call Model API
(streaming)"] C --> D["④ Stream Tool Execution
(parallel with generation)"] D --> E["⑤ Error Recovery
(413 → reactive compact)"] E --> F["⑥ Stop Hooks"] F --> G["⑦ Token Budget Check #2"] G --> H["⑧ Execute Tools
(14-step pipeline)"] H --> I["⑨ Inject Attachments
(memory, skills, queued cmds)"] I --> J["⑩ Assemble Messages"] J -->|"next turn"| A style A fill:#1a4d2e,stroke:#22c55e,color:#fff style C fill:#1a3a5c,stroke:#3b82f6,color:#fff style D fill:#4a3520,stroke:#f59e0b,color:#fff style E fill:#4a2020,stroke:#ef4444,color:#fff style H fill:#3a2050,stroke:#8b5cf6,color:#fff

10 Steps Per Iteration

Each time the loop runs, it does these 10 things in order. Every step has real source code behind it.

Step 1: Context Compression (4 stages)

Before calling the API, the system tries to fit everything into the context window. Four compression mechanisms fire in priority order (imports at lines 12-16, 115-116):

  1. Snip Compact — trims overly long individual messages in history
  2. Micro Compact — finer-grained editing based on tool_use_id, cache-friendly (line 370: “microcompact operates purely by tool_use_id”)
  3. Context Collapse — folds inactive context regions into summaries
  4. Auto Compact — when total tokens approach the threshold, triggers full compression

These are not mutually exclusive — they run in priority order:

flowchart LR
    A["Snip Compact
trim long messages"] -->|"still too big?"| B["Micro Compact
tool_use_id edits"] B -->|"still too big?"| C["Context Collapse
fold inactive regions"] C -->|"still too big?"| D["Auto Compact
full compression"] D -->|"API returns 413"| E["Reactive Compact
emergency, once only"] style A fill:#1a3a2e,stroke:#4ade80,color:#fff style B fill:#1a3a2e,stroke:#4ade80,color:#fff style C fill:#3a3520,stroke:#fbbf24,color:#fff style D fill:#4a2020,stroke:#ef4444,color:#fff style E fill:#4a1020,stroke:#dc2626,color:#fff

The system tries lightweight options first. If snip + micro bring tokens under the limit, the heavy compressors never run.

Step 2: Token Budget Check

If a token budget is active (feature('TOKEN_BUDGET'), line 280), the system checks whether to continue. Users can specify targets like “+500k”, and the system tracks cumulative output tokens per turn, injecting nudge messages near the goal to keep the model working.

Step 3: Call Model API

Line 659 — the actual API call:

for await (const message of deps.callModel({

This is a streaming call. The response arrives token by token, and the system processes it incrementally.

Step 4: Streaming Tool Execution

This is a critical optimization. Traditional agents wait for the model to finish generating all output, then execute tools. Claude Code uses StreamingToolExecutor (imported at line 96):

When the model is still generating its second tool call, the first one is already running:

Traditional Agent (sequential):
┌─────────────────────────┐┌───┐┌───┐┌───┐┌───┐┌───┐
│  LLM generates 5 calls  ││ T1││ T2││ T3││ T4││ T5│  ← 30s total
└─────────────────────────┘└───┘└───┘└───┘└───┘└───┘

Claude Code (streaming):
┌─────────────────────────┐
│  LLM generates 5 calls  │
├──┬──┬──┬──┬─────────────┘
│T1│T2│T3│T4│T5│                                       ← 18s total
└──┴──┴──┴──┴──┘
↑ tools start while LLM is still generating

In a turn with 5 tool calls, traditional waits 30 seconds. Streaming finishes in 18 — a 40% speedup from architecture alone, not model improvements.

Line 554-555 reveals an interesting detail: stop_reason === 'tool_use' is unreliable — “it’s not always set correctly.” The system detects tool calls by watching for tool_use blocks during streaming instead.

Step 5: Error Recovery

If the prompt is too long? Try context collapse drain. If that fails, try reactive compact (line 15-16). If the API returns 413 (prompt too long), trigger emergency compression and retry.

But there’s a circuit breaker: hasAttemptedReactiveCompact (line 209, initialized false at line 275) ensures each turn only attempts reactive compact once. Without this, a genuinely oversized conversation would loop forever.

The system also handles model degradation — if the primary model fails, it can fall back to a different model.

Step 6: Stop Hooks

After the model stops outputting, the system runs registered stop hooks. These can inspect the output and decide whether to let the model continue. This is where external governance plugs in.

Step 7: Token Budget Check (Again)

Yes, checked twice — once before calling the model (should we even start?) and once after (did we exceed the budget?). The second check decides whether to inject a “keep going” nudge or stop.

Step 8: Tool Execution

If the response contains tool_use blocks, execute them. Two paths:

  • runTools() (from toolOrchestration.ts, line 98) — batch execution
  • StreamingToolExecutor (line 96) — streaming execution, gated by config.gates.streamingToolExecution (line 561)

Each tool call goes through the 14-step execution pipeline in toolExecution.ts (1,745 lines) — validation, permission checks, hooks, actual execution, analytics. That’s a story for Part 3.

Step 9: Attachment Injection

After tools finish, the system injects additional context before the next turn:

  • Memory attachments — relevant memories from the memdir/ system
  • Skill discovery — matching skills based on the current task
  • Queued commands — any commands that were waiting

This happens after tool execution but before the next API call, ensuring the model has fresh context.

Step 10: Assemble and Loop

Build the new message list from all the pieces — original conversation, tool results, attachments, system reminders — and go back to step 1.

Why This Architecture Matters

Most open-source AI agents implement the loop as 50 lines of pseudocode: call model, parse tool calls, execute, repeat. Claude Code’s 1,421-line version exists because production reality is messy:

Context doesn’t fit. A real coding session easily hits 200K tokens. Without the 4-stage compression pipeline, the agent dies on every long conversation. Most agents just truncate and lose context. Claude Code compresses intelligently — lightweight first, heavy only when needed.

Models fail. APIs return 413, connections drop, rate limits hit. The 9 continue points aren’t over-engineering — they’re the minimum number of recovery paths needed for reliable operation. The hasAttemptedReactiveCompact circuit breaker is the kind of detail that separates a demo from a product.

Speed matters more than correctness of execution order. Streaming tool execution — starting the first tool while the model is still generating the third — is a user experience decision backed by architecture. Traditional agents feel slow because they are: they serialize everything. Claude Code parallelizes at the loop level.

Tokens cost money. The SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker in prompts.ts (914 lines) splits the system prompt into static (cacheable) and dynamic sections. If two requests share the same static prefix byte-for-byte, the API caches it. Source comment: “don’t modify content before the boundary, or you’ll destroy the cache.” This is prompt cache economics — saving Anthropic real compute costs at scale.

The Behavioral Constitution

Buried inside the prompt assembly, getSimpleDoingTasksSection() may be the most valuable function in the entire codebase. It encodes hard-won rules about what the model should NOT do:

  • Don’t add features the user didn’t ask for
  • Don’t over-abstract — three duplicate lines beat a premature abstraction
  • Don’t add comments to code you didn’t change
  • Don’t add unnecessary error handling
  • Read code before modifying it
  • If a method fails, diagnose before retrying
  • Report honestly — don’t say you ran something you didn’t

Anyone who has used Claude Code recognizes these rules. I’ve personally watched the system refuse to add “helpful” abstractions and stick to minimal changes. That’s not the model being disciplined — it’s the prompt constraining the model. The takeaway: don’t trust model self-discipline. Codify the behavior.

How Other Agents Compare

AspectClaude CodeCursorTypical OSS Agent
Loop complexity1,421 lines, 9 continue pointsUnknown (closed source)~50-200 lines
Compression4-stage pipeline + reactive 413 recoveryTab-level context pruningTruncate or fail
Tool executionStreaming (parallel with generation)SequentialSequential
Error recoveryCircuit breakers, model fallback, emergency compactBasic retryCrash
Prompt cachingStatic/dynamic boundary, section registryUnknownNone

The gap between Claude Code and most open-source agents is not model quality — it’s the program layer. The model is the same Opus or Sonnet for everyone. What makes Claude Code feel different is 1,421 lines of careful engineering around it.

The Bottom Line

The query loop is where “LLM talks, program walks” becomes concrete:

  • The LLM outputs text and tool call JSON. That’s it.
  • The program handles compression, budget tracking, error recovery, streaming, permissions, memory injection, and 14-step tool validation.
  • The 1,421 lines are not the model being smart. They’re the program being careful.

If you’re building an AI agent and your main loop is under 100 lines, you’re not handling the cases that matter. Production is not about the happy path. It’s about what happens when context overflows, the API returns 413, the user’s conversation hits 500 turns, and three tools need to run while the model is still thinking.


Next: Part 3 — The 14-Step Tool Execution Pipeline (coming soon) — what happens between “model says call this tool” and the tool actually running.

Previous: Part 1 — 5 Hidden Features Found in 510K Lines

Video: The AI Stack Explained — LLM Talks, Program Walks

🎧 More Ways to Consume This Content

Comments

This space is waiting for your voice.

Comments will be supported shortly. Stay connected for updates!

Preview of future curated comments

This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.