Why not just bill from my tracing data?

Because tracing is sampled. Billing is not. The moment you run a sampled pipeline for billing, some users get charged for work they didn't do and others get work for free. Legal and finance both notice.

Won't running two pipelines double the infrastructure cost?

Some, but not 2x. The billing path needs less breadth (fewer fields, fewer spans) but higher durability. Tracing needs more breadth but can tolerate loss. Right-sizing each path is usually cheaper than trying to make one path satisfy both.

Can OpenTelemetry carry cost data?

It can carry attribute values, but the semantics aren't built for auditable billing. Using OTel for both is possible; separating the billing events into a dedicated schema with strong durability guarantees usually ends up cleaner.

What about metrics with high-cardinality labels like user ID?

Don't. High-cardinality metrics explode your Prometheus/Mimir storage bill and are the wrong shape for billing anyway. Use structured log events or dedicated usage events for per-user attribution.

When does it make sense to unify the two?

Small workloads, low revenue-per-event, and no audit requirements. Once any of those change — especially audit — you'll end up splitting them. The earlier you split, the cheaper.

Observability and Cost Attribution: Why One Pipeline Isn't Enough

Tracing systems optimize for signal-to-noise. Billing systems optimize for accuracy and auditability. They look similar and they are not the same thing. A dual-path architecture for running both without cross-contamination.

February 15, 2026

Harrison Guo

10 min read

System Design Backend Engineering

A team I worked with tried to build their billing system on top of their tracing pipeline. The idea was clean: every operation already generates a span; spans already have duration and attributes; adding user_id and billable_units to each span lets finance query the trace store to compute invoices. One pipeline, less infrastructure. Beautiful.

Six weeks before the first billing cycle, the wheels came off. The tracing system was sampling at 10% because full-capture was too expensive. The sampler was head-based, meaning whether a trace got kept was decided at request entry, long before the code knew whether the request was billable. Some users got charged for 10% of their actual usage; others got free service. Nobody’s invoice agreed with the other team’s report.

The workaround — “don’t sample billable traces” — sounded reasonable, broke the tracing pipeline’s cost model immediately, and created a dozen new edge cases around which requests counted as “billable.” Within a month the team was reluctantly building a second pipeline for billing. They still had the first one for traces. Now they had two pipelines that disagreed with each other.

The postmortem landed on a single sentence: observability and cost attribution aren’t the same problem, and pretending they are is expensive twice.

tl;dr — Tracing and metrics optimize for signal-to-noise — you want the interesting outliers, sampling is OK, dropping data is tolerable. Billing optimizes for completeness and auditability — every event must be captured and durably recorded, end of story. The two pipelines have opposite trade-offs on sampling, retention, schema evolution, and cost. Building them as one pipeline forces one of the two to lose. Build them as two, share primitives where possible, let each specialize where it must.

Why They Look Alike

Observability pipelines and billing pipelines do look eerily similar from a distance:

Both capture events from production systems.
Both attach metadata to those events.
Both aggregate events over time windows.
Both export to a query layer.

It’s tempting — especially to engineers who like clean architecture — to say these are the same problem and build one system. The similarity is surface. The constraints are opposite.

Axis	Observability	Billing
Loss tolerance	High (sampling is fine)	Zero
Latency tolerance	Seconds to minutes	Minutes to hours is fine
Retention	Days to weeks	Years
Schema evolution	Fast, frequent	Slow, with audit trail
Cardinality profile	Low cardinality on hot dims	Arbitrary (per user, per resource)
Consumers	SRE, engineering, on-call	Finance, legal, customer
Failure mode	Blind spot in a dashboard	Wrong invoice, legal exposure

Axis Loss tolerance

Observability High (sampling is fine)

Billing Zero

Axis Latency tolerance

Observability Seconds to minutes

Billing Minutes to hours is fine

Axis Retention

Observability Days to weeks

Billing Years

Axis Schema evolution

Observability Fast, frequent

Billing Slow, with audit trail

Axis Cardinality profile

Observability Low cardinality on hot dims

Billing Arbitrary (per user, per resource)

Axis Consumers

Observability SRE, engineering, on-call

Billing Finance, legal, customer

Axis Failure mode

Observability Blind spot in a dashboard

Billing Wrong invoice, legal exposure

The one that really matters: loss tolerance. Everything else follows from it.

A tracing pipeline that drops 10% of spans is fine. You still see the outliers. You still find the slow paths. The system does its job.

A billing pipeline that drops 10% of events is a disaster. Some users underpay. Some users overpay. Finance reconciliation fails. You end up manually auditing transactions for weeks.

The moment one pipeline has to satisfy zero-loss and the other can tolerate 90% sampling, you have two different systems whether you wanted one or two.

The Dual-Path Architecture

The design I keep reaching back to is straightforward: two pipelines, shared ingest, separate durability and query paths.

flowchart LR
    App[Application code] --> |OTLP spans| TraceCol
    App --> |structured usage events
unsampled| UsageQ

    subgraph TracePath["Tracing path — loss-tolerant, fast"]
        TraceCol["Tracing collector"] --> Sampler["Sampler
head or tail · ~10%"]
        Sampler --> HotStore["Hot trace store
Tempo / Jaeger
days retention"]
    end

    subgraph BillingPath["Billing path — zero-loss, auditable"]
        UsageQ["Durable queue
Kafka / NATS JetStream
WAL-durable"] --> Warehouse["Columnar warehouse
BigQuery / Snowflake / ClickHouse
years retention"]
    end

    classDef trace fill:#fef5e7,stroke:#b7791f
    classDef bill fill:#f0fff4,stroke:#2f855a,stroke-width:2px
    class TracePath trace
    class BillingPath bill

Two emission paths from the application. Two pipelines behind them. Each tuned for its job.

The tracing path

Stays conventional. OpenTelemetry SDK emits spans. Collector applies head-based or tail-based sampling. Hot store (Tempo, Jaeger, Grafana Cloud) gets 10-20% of the volume. Retention a few days to a few weeks. Query layer is for engineers debugging incidents.

What I optimize for here:

Cost per span — you’re keeping billions; every byte matters.
Query latency — on-call wants answers in seconds.
Auto-instrumentation coverage — the fewer things you have to manually instrument, the better.

What I don’t care about:

Full capture. Sampling is fine.
Long retention. You’re debugging last Tuesday, not last fiscal year.
Per-user accuracy. If a single user’s trace got dropped, nobody cares.

The usage-event path

The dedicated billing pipeline. Every billable operation emits a usage event — a small, structured record with everything finance needs and nothing it doesn’t.

{
  "event_id": "ue_01HFNGR...",
  "occurred_at": "2026-02-14T18:22:30.145Z",
  "account_id": "acc_12345",
  "resource_id": "res_6789",
  "operation": "api.request",
  "dimensions": {
    "region": "us-east-1",
    "tier": "standard"
  },
  "units": {
    "requests": 1,
    "cpu_ms": 147,
    "egress_bytes": 8342
  },
  "idempotency_key": "req_abc_20260214182230"
}

The rules on this path:

Unsampled. Every billable operation emits exactly one event. No head sampling. No tail sampling. No “approximate.”
Durable writes. Emitter has a local write-ahead log or durable queue. If the downstream is down, events buffer locally until delivery. No dropped events under partial failure.
Idempotency keys. Every event has a unique ID (or composite key) so downstream dedup is trivial. This lets you retry safely.
Schema versioned and immutable. Once an event shape is shipped, it doesn’t mutate. New fields add a new version. Old versions keep working until you intentionally deprecate.
Long retention. Years, usually. Auditors ask for 2023’s data in 2027.

The downstream infrastructure matches: Kafka or NATS JetStream with high replication factor for ingest, columnar warehouse (BigQuery, Snowflake, ClickHouse) for aggregation and query, separate auth and access control from engineering-facing tools.

Not nothing. They share:

The trace/request ID. Usage events include the trace ID of the request that generated them. This is the one cross-pipeline link that matters — when finance escalates “this user says they were charged for X requests but they swear they only made Y,” you want to be able to find the traces of those Y requests.
OpenTelemetry as the emission library. OTel can emit both spans and custom events. Using it for both keeps the instrumentation codepaths uniform. But the pipelines behind the emitter are different.
The application’s definition of an “operation.” Both pipelines have opinions about what counts as one operation. Keep that definition single-source.

Why Head-Sampling Kills Billing

Worth dwelling on the specific thing that breaks when you try to unify.

Head-based sampling decides whether to record a trace at entry, based on trace ID. It’s O(1), stateless, and fair across traffic shapes — the standard default.

The failure: at entry time, the system has no idea whether this request will be billable. The sampler doesn’t know if the user is on a paid plan, if the request will succeed, if it will hit a billable feature. It just picks randomly.

Tail-based sampling fixes part of this — you decide after the fact, based on span attributes. Now you can keep all errors, all slow requests, all requests from paid users. Better, but still subject to buffering limits. Heavy tail-based samplers sit in front of your trace ingest pipeline and drop spans when buffers fill, which still gives you lossy billing during traffic bursts.

The only sampler that’s correct for billing is “capture everything.” And “capture everything” is what the tracing pipeline tries to avoid, because that’s what makes it expensive.

You can do “capture everything for billable operations, sample everything else” in one pipeline. It works. It also ends up being the most complex sampler you’ve ever written, with an exception branch that duplicates the decision logic from your actual billing code. The dedicated usage-event path is simpler.

Cardinality and the Per-User Problem

A related anti-pattern: attaching user ID as a Prometheus label.

Prometheus (and most metrics systems) store one time series per label combination. Add a user_id label to a metric that ten thousand users hit, and you just created ten thousand time series. Add a request_type label alongside, and that’s ten thousand × request-type-count. Cardinality explodes. Your metrics storage bill goes with it.

The instinct is fine — “I want to track per-user throughput” — the mechanism is wrong. Metrics with high-cardinality labels are the square peg. Usage events are the round hole. Emit a usage event with account_id as a dimension, aggregate per-user in the warehouse at query time.

Rule I use: metrics for engineering-facing dashboards, events for business-facing attribution. If the label cardinality could exceed ~1,000 distinct values, it belongs in an event, not a label.

The Boring Operational Details

Where the two pipelines actually differ in day-to-day ops:

Retention. Tracing a few weeks, maybe. Billing store, years. Warehouse partitioning by date and account_id makes multi-year queries practical. Archive older partitions to object storage.

Access control. Traces: engineers. Billing events: accounting + support + an audit-only read path for legal. Not the same principals, not the same ACL model.

Schema governance. Traces: OTel semantic conventions, loose. Billing events: your own schema with a proto or Avro definition, version bumps tracked in a migration log, additive only.

Reconciliation. Billing needs to agree with itself. Daily reconciliation job that asserts “yesterday’s event count per user equals the sum of the per-hour counts” catches silent drops early. No equivalent makes sense for tracing.

Replay. When a billing bug is discovered, you need to replay historical events through a fixed pipeline. Kafka’s offset model makes this natural; NATS JetStream has it too. The tracing pipeline rarely needs replay — if the last two weeks of traces have a bug, you shrug and fix forward.

When You Can Get Away With One

Small workloads with no audit requirement, usage-based pricing below ~$1/user, and a team of three — one pipeline is fine. Add user attributes to spans, store them all, build a nightly aggregation job, call it billing. It works.

The threshold where it stops working is somewhere around:

Revenue per customer exceeds the cost of a mistake. At $10k/month per customer, a dropped event is a $10k issue.
The first auditor asks “show me exactly what this customer used in March 2024.” Unsampled, durable, retrievable, signed — that’s the table stakes for audit-grade billing, and sampled traces can’t meet any of those.
Engineering starts wanting cheaper traces. When the tracing pipeline outgrows your budget and someone proposes “let’s sample more aggressively,” you’re about to break billing.

When any of those lights up, separate the pipelines. The cheapest time to separate is before you’ve built tools on top of the unified one.

When to Invest in Splitting the Pipelines

Observability and cost attribution are adjacent problems that optimize for opposite things. A tracing pipeline that compromises on completeness becomes a bad billing pipeline. A billing pipeline that compromises on cardinality and retention becomes a bad tracing pipeline. Building one system that satisfies both usually produces two systems that satisfy neither.

The dual-path design isn’t more complex. It’s just honest about the constraints. Same emission library, same operation definition, two paths behind the emitter, each tuned for its job.

If you’re about to launch usage-based pricing and you’re planning to compute invoices from your trace store, rethink it now. The sooner you split, the cheaper the split.

NATS vs Kafka vs MQTT: Same Category, Very Different Jobs — why the durability choice on the billing path matters so much.
RPC vs NATS: It’s Not About Sync vs Async — It’s About Who Owns Completion — completion ownership applies to the emit path, too.

🎧 More Ways to Consume This Content

HarrisonSecurityLab Podcast

I occasionally advise small teams on backend reliability, Go performance, and production AI systems. Learn more: /services

Comments

This space is waiting for your voice.

Comments will be supported shortly. Stay connected for updates!

Preview of future curated comments

This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.

Observability and Cost Attribution: Why One Pipeline Isn't Enough

Tracing systems optimize for signal-to-noise. Billing systems optimize for accuracy and auditability. They look similar and they are not the same thing. A dual-path architecture for running both without cross-contamination.

Table of Contents

Why They Look Alike

The Dual-Path Architecture

The tracing path

The usage-event path

Why Head-Sampling Kills Billing

Cardinality and the Per-User Problem

The Boring Operational Details

When You Can Get Away With One

When to Invest in Splitting the Pipelines

🎧 More Ways to Consume This Content

Comments

Leave a Comment

Observability and Cost Attribution: Why One Pipeline Isn't Enough

Tracing systems optimize for signal-to-noise. Billing systems optimize for accuracy and auditability. They look similar and they are not the same thing. A dual-path architecture for running both without cross-contamination.

Table of Contents

Why They Look Alike

The Dual-Path Architecture

The tracing path

The usage-event path

What the two paths share

Why Head-Sampling Kills Billing

Cardinality and the Per-User Problem

The Boring Operational Details

When You Can Get Away With One

When to Invest in Splitting the Pipelines

Related

🎧 More Ways to Consume This Content

[ Agent_Architecture_Notes ]

Related Articles

IronSys: A Production Blueprint for Modern Concurrency

From Locks to Actors: The Four Pillars of Modern Concurrency

Go Context in Distributed Systems: What Actually Works in Production

Comments

Leave a Comment

[ Connect_With_Me ]