Why can't AI API usage be tracked like any other RPC call?

Because the per-call cost variance is 100×+ between a short gpt-3.5 completion and a long gpt-4o reasoning call, and it depends on values (input tokens, output tokens, cache hits, model tier) that aren't visible at the transport layer. Ordinary RPC observability sees latency and status; it doesn't see cost. LLM observability has to capture the cost basis explicitly.

Is OpenTelemetry enough for this?

As instrumentation, yes — OpenTelemetry is a fine substrate for the shared stem. As the full pipeline, no. OTel spans are designed for debugging, sampling is the default, and long-term per-tenant attribution isn't what OTel is built for. Use OTel for the stem; separate specialized storage for the billing and analytics arms.

How do streaming responses change the cost model?

Streaming completes incrementally. You know input tokens upfront but learn output tokens only as chunks arrive. If the stream is cancelled, you've paid for partial output. Tracking accurately means either (a) capturing the final token count from the provider's final event, or (b) running your own tokenizer on the accumulated chunks. Either works; skipping this gives wrong numbers.

Do I need to track model and provider on every call?

Yes. Per-call routing (smart router picks provider based on load, cost, feature availability) is increasingly common. If your instrumentation doesn't capture model+provider as dimensions, you lose the ability to optimize later — and ability to answer 'which provider are we spending $X on' during a vendor review.

How do I prevent user A's retry storm from blowing the whole bill?

Enforce quotas at the billing arm, not at the tracing arm. Rate limiters and budget enforcement read the usage arm's view — not a sampled trace view. A user exceeding their budget should get a 429 before the call is even made, based on a query against the billing store (with a cache for speed).

Observability and Billing for AI API Calls: A T-Shaped Architecture

AI API calls are unlike ordinary RPC: per-request cost varies 100×, tokens and models are first-class, streaming muddies timing, caching changes the pricing. A T-shaped instrumentation architecture — shared stem, specialized arms — that handles tracing, billing, and cost analytics without any of them contaminating the others.

April 1, 2026

Harrison Guo

13 min read

AI Engineering System Architecture

Adding AI API calls to an existing backend is where most teams’ observability and billing instincts break. The calls look similar to any other RPC — send a JSON request, receive a JSON response. The difference is what happens to the meter. An ordinary RPC costs you deterministic compute: a few milliseconds of CPU, a few KB of network. An LLM API call costs you between $0.0001 and $1.50 depending on which model, which provider, how long the prompt was, how long the completion went, and whether the provider’s prompt cache kicked in. Same endpoint, same code path, two orders of magnitude of price variance per call.

The first teams I worked with through this problem made the obvious mistake: piggyback AI cost on the tracing system. Add token counts as attributes to spans. Query traces at month-end for invoicing. Same mistake as in the dual-path architecture post, but with a uniquely AI-shaped twist: because the per-call cost variance is so wide, sampled tracing is even more wrong for billing than usual.

The cleaner architecture is T-shaped. One shared instrumentation stem that captures every LLM call’s cost basis. Three specialized arms branching off — tracing, billing, analytics — each optimized for its job, each independent. Let me walk through why each piece is there.

tl;dr — AI API calls break the normal observability-vs-billing split because per-call cost variance is 100×+ and the cost is driven by values (tokens, model, provider, cache hit/miss) that aren’t in a normal transport trace. A T-shaped architecture gives you: (1) one shared instrumentation layer that captures the cost basis on every call, (2) a trace arm for debugging, (3) a billing/metering arm for quotas and invoicing, (4) a cost-analytics arm for feature-level ROI. The arms are independent. Each gets the right durability, cardinality, and retention for its job. Compressing this into one system leaves at least one of the three degraded.

Why AI API Calls Are Different

Ordinary backend services have costs that scale roughly with traffic: more requests, more CPU. You can amortize billing to a flat per-request cost or a time-windowed aggregate. LLM API calls break that assumption in four specific ways:

1. Per-call cost variance is huge. A gpt-3.5-turbo call with a 100-token prompt and a 50-token completion costs about $0.0003. A gpt-4o-mini call with an 8K prompt and a 2K response costs about $0.002. A claude-3.5-sonnet call with 20K context and 4K output costs about $0.09. A long reasoning run on o1 can run to $2+ for a single call. Same pattern (one API call from your backend to the provider), 7000× cost spread.

2. The cost basis isn’t at the transport layer. HTTP status code, request size, response size — none of these tell you what the call cost. You need input tokens, output tokens (sometimes cached tokens, sometimes reasoning tokens), model name, and provider. These live inside the request/response payload.

3. Streaming changes accounting. A streaming response arrives in chunks. The provider’s final event usually includes the definitive token counts, but if the stream is cancelled mid-flight (user navigated away, your backend hit a timeout), you’ve paid for partial output and your instrumentation has to capture it.

4. Caching bends pricing. OpenAI’s prompt caching, Anthropic’s prompt caching, and most self-hosted solutions charge differently for cached-input tokens (often 50-90% discount). A system that doesn’t distinguish cached from uncached tokens will systematically over-bill internal features that hit cache hot.

Any billing or attribution system that doesn’t see all four of these cleanly is going to produce numbers that don’t match the invoice.

The T-Shape

Here’s the architecture I’d build today:

flowchart TB
    App[Application code
every LLM call site] --> Stem

    subgraph Stem["Shared Instrumentation Stem"]
        direction TB
        Wrap["Wrap every LLM call · capture cost basis"]
        Fields["model · provider
input_tokens · cached_tokens · output_tokens
reasoning_tokens · cost_usd · latency_ms
feature_tag · user_id · trace_id · cache_hit"]
        Wrap --> Fields
    end

    Stem --> Trace
    Stem --> Bill
    Bill --> Analytics

    subgraph Trace["Arm 1 · Tracing"]
        T1["OTel spans · sampled 10–30%"]
        T2["Days to weeks retention"]
        T3["For engineering debug"]
    end

    subgraph Bill["Arm 2 · Billing / Metering"]
        B1["Every call · unsampled"]
        B2["Durable queue → columnar warehouse"]
        B3["Long retention · strong schema"]
        B4["Quotas · invoicing · usage"]
    end

    subgraph Analytics["Arm 3 · Cost Analytics"]
        A1["Materialized views on billing data"]
        A2["Per-feature · per-cohort · per-model"]
        A3["Routing · caching · model-tier decisions"]
    end

    classDef stem fill:#e8f4f8,stroke:#2c5282,stroke-width:2px
    classDef trace fill:#fef5e7,stroke:#b7791f
    classDef bill fill:#f0fff4,stroke:#2f855a,stroke-width:2px
    classDef analytics fill:#faf5ff,stroke:#6b46c1
    class Stem stem
    class Trace trace
    class Bill bill
    class Analytics analytics

Three arms, all fed from one instrumentation stem. Each arm’s durability, sampling, cardinality, and retention are tuned for its job — and none of them contaminate the others.

The stem is the same everywhere. The three arms read from that stem but branch early, each optimized for its purpose. Let me walk through the stem first, then each arm.

The Shared Stem: Instrumentation Wrapper

The stem is a wrapper around every LLM API call your application makes. In Go, something like:

type LLMClient interface {
    Call(ctx context.Context, req LLMRequest) (LLMResponse, error)
}

type InstrumentedClient struct {
    inner LLMClient
    emit  func(UsageEvent) // fire-and-forget to both arms
}

type UsageEvent struct {
    EventID          string    `json:"event_id"`
    OccurredAt       time.Time `json:"occurred_at"`
    AccountID        string    `json:"account_id"`
    UserID           string    `json:"user_id"`
    TraceID          string    `json:"trace_id"`
    RequestID        string    `json:"request_id"`
    FeatureTag       string    `json:"feature_tag"` // "summarize", "translate", "agent.plan", etc.
    Provider         string    `json:"provider"`     // "openai", "anthropic", "local"
    Model            string    `json:"model"`        // "gpt-4o-mini", "claude-3-5-sonnet", ...
    InputTokens      int       `json:"input_tokens"`
    CachedTokens     int       `json:"cached_tokens"`
    OutputTokens     int       `json:"output_tokens"`
    ReasoningTokens  int       `json:"reasoning_tokens,omitempty"` // o1-family
    CacheHit         bool      `json:"cache_hit"`
    CostUSD          float64   `json:"cost_usd"`
    LatencyMs        int64     `json:"latency_ms"`
    Streaming        bool      `json:"streaming"`
    StreamCompleted  bool      `json:"stream_completed"`
    StatusCode       int       `json:"status_code"`
    ErrorCode        string    `json:"error_code,omitempty"`
    IdempotencyKey   string    `json:"idempotency_key"`
}

func (c *InstrumentedClient) Call(ctx context.Context, req LLMRequest) (LLMResponse, error) {
    start := time.Now()
    trace, _ := trace.FromContext(ctx)

    resp, err := c.inner.Call(ctx, req)

    event := UsageEvent{
        EventID:         uuid.New().String(),
        OccurredAt:      start,
        AccountID:       AccountFromCtx(ctx),
        UserID:          UserFromCtx(ctx),
        TraceID:         trace.ID(),
        RequestID:       RequestIDFromCtx(ctx),
        FeatureTag:      FeatureFromCtx(ctx),
        Provider:        req.Provider,
        Model:           req.Model,
        InputTokens:     resp.Usage.InputTokens,
        CachedTokens:    resp.Usage.CachedTokens,
        OutputTokens:    resp.Usage.OutputTokens,
        ReasoningTokens: resp.Usage.ReasoningTokens,
        CacheHit:        resp.Usage.CachedTokens > 0,
        CostUSD:         calculateCost(req.Model, req.Provider, resp.Usage),
        LatencyMs:       time.Since(start).Milliseconds(),
        Streaming:       req.Streaming,
        StreamCompleted: resp.StreamCompleted,
        StatusCode:      resp.StatusCode,
        ErrorCode:       errorCodeOf(err),
        IdempotencyKey:  req.IdempotencyKey,
    }

    c.emit(event) // to both arms, fire-and-forget with local buffering
    return resp, err
}

Three properties of the stem that matter:

1. It’s per-call unsampled. Every call emits one event. No head sampling. If your LLM volume is so high that unsampled events are expensive, that’s a signal you have a billing problem worth paying for the emission.

2. It captures the cost basis, not just cost. Store the raw token counts and model, not just cost_usd. Prices change. Models get added. Discount tiers appear. You want to be able to re-price historical usage if needed — which you can only do if you kept the raw components.

3. It includes feature_tag. The single most valuable dimension for cost analytics. Without it, you know “we spent $40k on LLM calls last month.” With it, you know “$18k on summarization, $8k on agent planning, $6k on translations, $8k on misc.” That second view is what drives optimization decisions.

The wrapper is thin. Every LLM call in the codebase goes through it. Make it an interface and provide a real client in production, a recording mock in tests. The stem is one piece of code; you get right once; it benefits every arm forever.

Arm 1: Tracing

The tracing arm is unchanged from the dual-path architecture setup. OTel spans with the usage event attached as attributes. Sampled at 10-30% for cost. Retained for days to weeks. Queried by engineers debugging slow/failed LLM calls.

Why is this arm even here if we have the billing arm? Because the questions are different:

Tracing questions: “Why was this specific user’s call slow? Which provider was hit? Did it retry? What was the full prompt?”
Billing questions: “How many tokens did account X use last month?”
Analytics questions: “Which feature’s cost per MAU grew 30% QoQ?”

The tracing arm carries the full prompt/response (maybe redacted), the full span context, and the detail needed for debugging. The billing arm carries only the cost basis. The analytics arm carries aggregates of the billing arm.

You could, in principle, query the billing arm for “show me the usage event for request 123” and answer a debugging question. You can’t query the billing arm for “show me the full prompt and response.” Different optimizations, different data shapes.

Arm 2: Billing and Metering

The billing arm is where the cost-attribution work happens. It’s more specialized than a generic billing pipeline because AI usage has specific shapes.

Ingest: the stem emits to a durable queue (Kafka, NATS JetStream, AWS Kinesis). Replication factor 3. No sampling. Every call emits exactly one event.

Storage: a columnar warehouse partitioned by date and account. BigQuery, Snowflake, ClickHouse, or a self-managed ClickHouse cluster work well. Partitioning by date lets you tier old data to object storage cheaply.

Aggregation patterns:

Per-user, per-day, per-feature: SUM(cost_usd) GROUP BY user_id, date, feature_tag.
Per-account real-time usage: a Redis hash keyed by account, updated from the stream with a small lag. This feeds quota enforcement.
Per-feature monthly totals: materialized views refreshed hourly.
Provider split: SUM(cost_usd) GROUP BY provider, model — for vendor renegotiation and routing decisions.

Quota enforcement: the real-time Redis hash is read by the application layer before issuing a new LLM call. If the user’s current-period usage + projected cost of the new call exceeds their quota, you return 429 Too Many Requests or a feature-specific error. Cache locally with a few-second TTL to avoid hot-spotting Redis.

func (s *Service) CheckQuotaAndCall(ctx context.Context, req LLMRequest) (LLMResponse, error) {
    account := AccountFromCtx(ctx)
    current := s.usageCache.Get(account.ID)
    estimated := estimateCost(req) // worst-case upper bound

    if current + estimated > account.Quota {
        return LLMResponse{}, ErrQuotaExceeded
    }

    return s.llmClient.Call(ctx, req)
}

Two subtleties:

estimateCost is a worst-case upper bound (max output tokens × output price), not expected cost. Otherwise users can game the system by making many small calls that each “fit” until the actual usage blows the budget.
After the call completes, update the real-time cache with the actual cost. Over time, current tracks actual cumulative usage within milliseconds.

Reconciliation: a daily job compares event-count-per-user from the last 24 hours to the sum of per-hour counts. Drift indicates missing events (usually a broken emitter) and pages someone.

Arm 3: Cost Analytics

The analytics arm is where platform and product teams derive insight from the billing data. It’s a set of queries, dashboards, and sometimes pre-computed materialized views on top of the billing warehouse.

The queries that actually drive decisions:

Feature cost-to-value ratio: cost per MAU per feature. If summarization costs $0.30 per MAU and translation costs $0.05, and they drive similar engagement, translation is more efficient. Either summarization gets optimized (smaller model, tighter prompts, more caching) or the product-side value of summarization needs to justify the cost.

Model migration analysis: when a new model ships (say gpt-4o-mini arrives and is 5× cheaper than gpt-4), you want to know which features would benefit from migration. Query: SELECT feature, model, COUNT, AVG(cost_usd) GROUP BY feature, model tells you where gpt-4 is still running and approximately what a migration would save.

Prompt-caching ROI: for features where you’ve added prompt caching, query the cache hit rate and effective discount. If cache hit rate is < 40%, caching isn’t paying off and the cache-key logic may be too strict.

User cohort analysis: which users drive disproportionate spend? Usually a small cohort of power users generates 50%+ of cost. Useful for pricing-tier design and abuse detection.

Provider performance comparison: for routing decisions, compare latency and error rate across providers at comparable model tiers. If provider A has 99.5% success and provider B has 98.0%, the 1.5% difference is a production quality issue even if pricing is identical.

These queries don’t need to be real-time. Hourly or daily updates are fine. What matters is that the data is there, correctly attributed, and queryable — which is what the billing arm’s schema enables.

The Interesting Corner Cases

A few scenarios that hit every team and deserve explicit attention.

Streaming cancellations. A user starts a chat, abandons after 3 seconds while a 30-second response is streaming. You’ve paid for the partial output. Your stem needs to emit an event with the actual tokens produced, not the hoped-for final. The provider’s stream usually ends with a [DONE] or equivalent event containing final usage; if the stream ends mid-flight, you use whatever tokens arrived.

Retries. Your application retries on 5xx. Each retry is a separate API call, each one costs. The stem emits one event per actual call, not per logical intent. If a user’s “send message” action retries 3 times and the third succeeds, that’s 3 billable events (usually). Your billing arm should show 3 events, not 1; your trace arm should show the whole retry as one parent span with 3 child calls. Don’t deduplicate at the billing arm.

Tool calls / function calls. An agent makes a tool call, gets a result, makes another LLM call with the result in context. That’s two LLM calls. Both get events. The agent’s overall task might cost more than the sum of simple completions because each intermediate call re-sends context. Surface this in analytics — “feature=agent.run, tool_calls=5, total_cost=$0.14” — so product can reason about agent efficiency.

Fine-tuning and embeddings. Training a fine-tune is a multi-hour one-shot job with a single invoice. Doesn’t fit the per-call event pattern. Either emit a large single event with appropriate schema fields (event_type: "fine_tune"), or handle it as a separate out-of-band flow. I prefer a single large event — keeps one source of truth for cost.

Similarly, embeddings calls are cheap per call but can come in high volume. The stem works fine, but your warehouse partitioning needs to handle the volume (maybe separate table for embeddings, same schema).

Multi-turn caching. Anthropic’s and OpenAI’s prompt caching reuse the prefix of previous prompts. Your stem should capture cached_tokens as a distinct field. The cost calculator applies the discounted rate. Analytics can answer “our cache hit rate is 65% across agent-planning calls, saving approximately $X/month.”

Quota Enforcement vs Cost Overrun

One last architectural pattern worth calling out. There are two related but distinct jobs:

Quota enforcement: prevent users from spending more than their allocated budget.
Cost overrun protection: prevent runaway usage from a bug or attack.

The billing arm handles quotas. Runaway protection is a separate layer — rate limits at the API gateway, per-feature budget caps enforced at the application level, and alerts tied to anomaly detection on the cost analytics arm.

Why separate? Because quota enforcement checks per-request latency (needs to be fast). Runaway protection is about catching patterns over windows (a user making 10k calls in an hour is almost certainly a bug). Combining them creates a system that’s too slow for per-request checks and too coarse for windowed detection.

Why the Shape Matters More Than the Tools

The instinct to piggyback AI billing on existing observability is understandable. It’s also expensive — both in bad numbers and in retrofitting pain when you eventually split the systems.

The T-shape is boring infrastructure and it works. One instrumentation layer. Three specialized arms. Each arm optimized for its job. Total engineering effort: maybe a week of design, another week of implementation, plus ongoing maintenance of the schema as you add models and providers. Compared to the alternative — a patched tracing pipeline that finance doesn’t trust and platform teams can’t query — it’s cheap.

The bigger shift in thinking is that AI API calls are a different shape of backend operation. They’re not RPC-with-a-higher-dollar-amount. Token counts, model tiers, provider variance, cache hit rates, streaming semantics — these are first-class in the cost model, so they have to be first-class in the instrumentation. Once the stem captures them correctly, the three arms above are tactical. It’s the stem that makes or breaks the system.

Observability and Cost Attribution: Why One Pipeline Isn’t Enough — the general principle that this post specializes for AI workloads.
Consistency in Distributed Systems: Scenarios, Trade-offs, and What Actually Works — usage events in the billing arm need a specific consistency posture (unsampled, durable, at-least-once); consistency framing applies.
NATS vs Kafka vs MQTT: Same Category, Very Different Jobs — the durable queue choice underlying the billing arm.
gRPC Interceptors in Production: Design Patterns That Survive Real Load — the instrumentation stem, if your LLM calls go through a gRPC gateway, is naturally an interceptor.

🎧 More Ways to Consume This Content

AI Operator Deep Dive Podcast

I occasionally advise small teams on backend reliability, Go performance, and production AI systems. Learn more: /services

Comments

This space is waiting for your voice.

Comments will be supported shortly. Stay connected for updates!

Preview of future curated comments

This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.

Observability and Billing for AI API Calls: A T-Shaped Architecture

Table of Contents

Why AI API Calls Are Different

The T-Shape

The Shared Stem: Instrumentation Wrapper

Arm 1: Tracing

Arm 2: Billing and Metering

Arm 3: Cost Analytics

The Interesting Corner Cases

Quota Enforcement vs Cost Overrun

Why the Shape Matters More Than the Tools

🎧 More Ways to Consume This Content

Comments

Leave a Comment

Observability and Billing for AI API Calls: A T-Shaped Architecture

Table of Contents

Why AI API Calls Are Different

The T-Shape

The Shared Stem: Instrumentation Wrapper

Arm 1: Tracing

Arm 2: Billing and Metering

Arm 3: Cost Analytics

The Interesting Corner Cases

Quota Enforcement vs Cost Overrun

Why the Shape Matters More Than the Tools

Related

🎧 More Ways to Consume This Content

[ Agent_Architecture_Notes ]

Related Articles

Observability and Cost Attribution: Why One Pipeline Isn't Enough

NATS vs Kafka vs MQTT: Same Category, Very Different Jobs

Scale-Up vs Scale-Out: Why Every Language Wins Somewhere

Comments

Leave a Comment

[ Connect_With_Me ]