Observability and Cost Attribution: Why One Pipeline Isn't Enough
Tracing systems optimize for signal-to-noise. Billing systems optimize for accuracy and auditability. They look similar and they are not the same thing. A dual-path architecture for running both without cross-contamination.
Table of Contents
A team I worked with tried to build their billing system on top of their tracing pipeline. The idea was clean: every operation already generates a span; spans already have duration and attributes; adding user_id and billable_units to each span lets finance query the trace store to compute invoices. One pipeline, less infrastructure. Beautiful.
Six weeks before the first billing cycle, the wheels came off. The tracing system was sampling at 10% because full-capture was too expensive. The sampler was head-based, meaning whether a trace got kept was decided at request entry, long before the code knew whether the request was billable. Some users got charged for 10% of their actual usage; others got free service. Nobody’s invoice agreed with the other team’s report.
The workaround — “don’t sample billable traces” — sounded reasonable, broke the tracing pipeline’s cost model immediately, and created a dozen new edge cases around which requests counted as “billable.” Within a month the team was reluctantly building a second pipeline for billing. They still had the first one for traces. Now they had two pipelines that disagreed with each other.
The postmortem landed on a single sentence: observability and cost attribution aren’t the same problem, and pretending they are is expensive twice.
tl;dr — Tracing and metrics optimize for signal-to-noise — you want the interesting outliers, sampling is OK, dropping data is tolerable. Billing optimizes for completeness and auditability — every event must be captured and durably recorded, end of story. The two pipelines have opposite trade-offs on sampling, retention, schema evolution, and cost. Building them as one pipeline forces one of the two to lose. Build them as two, share primitives where possible, let each specialize where it must.
Why They Look Alike
Observability pipelines and billing pipelines do look eerily similar from a distance:
- Both capture events from production systems.
- Both attach metadata to those events.
- Both aggregate events over time windows.
- Both export to a query layer.
It’s tempting — especially to engineers who like clean architecture — to say these are the same problem and build one system. The similarity is surface. The constraints are opposite.
| Axis | Observability | Billing |
|---|---|---|
| Loss tolerance | High (sampling is fine) | Zero |
| Latency tolerance | Seconds to minutes | Minutes to hours is fine |
| Retention | Days to weeks | Years |
| Schema evolution | Fast, frequent | Slow, with audit trail |
| Cardinality profile | Low cardinality on hot dims | Arbitrary (per user, per resource) |
| Consumers | SRE, engineering, on-call | Finance, legal, customer |
| Failure mode | Blind spot in a dashboard | Wrong invoice, legal exposure |
The one that really matters: loss tolerance. Everything else follows from it.
A tracing pipeline that drops 10% of spans is fine. You still see the outliers. You still find the slow paths. The system does its job.
A billing pipeline that drops 10% of events is a disaster. Some users underpay. Some users overpay. Finance reconciliation fails. You end up manually auditing transactions for weeks.
The moment one pipeline has to satisfy zero-loss and the other can tolerate 90% sampling, you have two different systems whether you wanted one or two.
The Dual-Path Architecture
The design I keep reaching back to is straightforward: two pipelines, shared ingest, separate durability and query paths.
flowchart LR
App[Application code] --> |OTLP spans| TraceCol
App --> |structured usage events
unsampled| UsageQ
subgraph TracePath["Tracing path — loss-tolerant, fast"]
TraceCol["Tracing collector"] --> Sampler["Sampler
head or tail · ~10%"]
Sampler --> HotStore["Hot trace store
Tempo / Jaeger
days retention"]
end
subgraph BillingPath["Billing path — zero-loss, auditable"]
UsageQ["Durable queue
Kafka / NATS JetStream
WAL-durable"] --> Warehouse["Columnar warehouse
BigQuery / Snowflake / ClickHouse
years retention"]
end
classDef trace fill:#fef5e7,stroke:#b7791f
classDef bill fill:#f0fff4,stroke:#2f855a,stroke-width:2px
class TracePath trace
class BillingPath bill
Two emission paths from the application. Two pipelines behind them. Each tuned for its job.
The tracing path
Stays conventional. OpenTelemetry SDK emits spans. Collector applies head-based or tail-based sampling. Hot store (Tempo, Jaeger, Grafana Cloud) gets 10-20% of the volume. Retention a few days to a few weeks. Query layer is for engineers debugging incidents.
What I optimize for here:
- Cost per span — you’re keeping billions; every byte matters.
- Query latency — on-call wants answers in seconds.
- Auto-instrumentation coverage — the fewer things you have to manually instrument, the better.
What I don’t care about:
- Full capture. Sampling is fine.
- Long retention. You’re debugging last Tuesday, not last fiscal year.
- Per-user accuracy. If a single user’s trace got dropped, nobody cares.
The usage-event path
The dedicated billing pipeline. Every billable operation emits a usage event — a small, structured record with everything finance needs and nothing it doesn’t.
{
"event_id": "ue_01HFNGR...",
"occurred_at": "2026-02-14T18:22:30.145Z",
"account_id": "acc_12345",
"resource_id": "res_6789",
"operation": "api.request",
"dimensions": {
"region": "us-east-1",
"tier": "standard"
},
"units": {
"requests": 1,
"cpu_ms": 147,
"egress_bytes": 8342
},
"idempotency_key": "req_abc_20260214182230"
}
The rules on this path:
- Unsampled. Every billable operation emits exactly one event. No head sampling. No tail sampling. No “approximate.”
- Durable writes. Emitter has a local write-ahead log or durable queue. If the downstream is down, events buffer locally until delivery. No dropped events under partial failure.
- Idempotency keys. Every event has a unique ID (or composite key) so downstream dedup is trivial. This lets you retry safely.
- Schema versioned and immutable. Once an event shape is shipped, it doesn’t mutate. New fields add a new version. Old versions keep working until you intentionally deprecate.
- Long retention. Years, usually. Auditors ask for 2023’s data in 2027.
The downstream infrastructure matches: Kafka or NATS JetStream with high replication factor for ingest, columnar warehouse (BigQuery, Snowflake, ClickHouse) for aggregation and query, separate auth and access control from engineering-facing tools.
What the two paths share
Not nothing. They share:
- The trace/request ID. Usage events include the trace ID of the request that generated them. This is the one cross-pipeline link that matters — when finance escalates “this user says they were charged for X requests but they swear they only made Y,” you want to be able to find the traces of those Y requests.
- OpenTelemetry as the emission library. OTel can emit both spans and custom events. Using it for both keeps the instrumentation codepaths uniform. But the pipelines behind the emitter are different.
- The application’s definition of an “operation.” Both pipelines have opinions about what counts as one operation. Keep that definition single-source.
Why Head-Sampling Kills Billing
Worth dwelling on the specific thing that breaks when you try to unify.
Head-based sampling decides whether to record a trace at entry, based on trace ID. It’s O(1), stateless, and fair across traffic shapes — the standard default.
The failure: at entry time, the system has no idea whether this request will be billable. The sampler doesn’t know if the user is on a paid plan, if the request will succeed, if it will hit a billable feature. It just picks randomly.
Tail-based sampling fixes part of this — you decide after the fact, based on span attributes. Now you can keep all errors, all slow requests, all requests from paid users. Better, but still subject to buffering limits. Heavy tail-based samplers sit in front of your trace ingest pipeline and drop spans when buffers fill, which still gives you lossy billing during traffic bursts.
The only sampler that’s correct for billing is “capture everything.” And “capture everything” is what the tracing pipeline tries to avoid, because that’s what makes it expensive.
You can do “capture everything for billable operations, sample everything else” in one pipeline. It works. It also ends up being the most complex sampler you’ve ever written, with an exception branch that duplicates the decision logic from your actual billing code. The dedicated usage-event path is simpler.
Cardinality and the Per-User Problem
A related anti-pattern: attaching user ID as a Prometheus label.
Prometheus (and most metrics systems) store one time series per label combination. Add a user_id label to a metric that ten thousand users hit, and you just created ten thousand time series. Add a request_type label alongside, and that’s ten thousand × request-type-count. Cardinality explodes. Your metrics storage bill goes with it.
The instinct is fine — “I want to track per-user throughput” — the mechanism is wrong. Metrics with high-cardinality labels are the square peg. Usage events are the round hole. Emit a usage event with account_id as a dimension, aggregate per-user in the warehouse at query time.
Rule I use: metrics for engineering-facing dashboards, events for business-facing attribution. If the label cardinality could exceed ~1,000 distinct values, it belongs in an event, not a label.
The Boring Operational Details
Where the two pipelines actually differ in day-to-day ops:
Retention. Tracing a few weeks, maybe. Billing store, years. Warehouse partitioning by date and account_id makes multi-year queries practical. Archive older partitions to object storage.
Access control. Traces: engineers. Billing events: accounting + support + an audit-only read path for legal. Not the same principals, not the same ACL model.
Schema governance. Traces: OTel semantic conventions, loose. Billing events: your own schema with a proto or Avro definition, version bumps tracked in a migration log, additive only.
Reconciliation. Billing needs to agree with itself. Daily reconciliation job that asserts “yesterday’s event count per user equals the sum of the per-hour counts” catches silent drops early. No equivalent makes sense for tracing.
Replay. When a billing bug is discovered, you need to replay historical events through a fixed pipeline. Kafka’s offset model makes this natural; NATS JetStream has it too. The tracing pipeline rarely needs replay — if the last two weeks of traces have a bug, you shrug and fix forward.
When You Can Get Away With One
Small workloads with no audit requirement, usage-based pricing below ~$1/user, and a team of three — one pipeline is fine. Add user attributes to spans, store them all, build a nightly aggregation job, call it billing. It works.
The threshold where it stops working is somewhere around:
- Revenue per customer exceeds the cost of a mistake. At $10k/month per customer, a dropped event is a $10k issue.
- The first auditor asks “show me exactly what this customer used in March 2024.” Unsampled, durable, retrievable, signed — that’s the table stakes for audit-grade billing, and sampled traces can’t meet any of those.
- Engineering starts wanting cheaper traces. When the tracing pipeline outgrows your budget and someone proposes “let’s sample more aggressively,” you’re about to break billing.
When any of those lights up, separate the pipelines. The cheapest time to separate is before you’ve built tools on top of the unified one.
When to Invest in Splitting the Pipelines
Observability and cost attribution are adjacent problems that optimize for opposite things. A tracing pipeline that compromises on completeness becomes a bad billing pipeline. A billing pipeline that compromises on cardinality and retention becomes a bad tracing pipeline. Building one system that satisfies both usually produces two systems that satisfy neither.
The dual-path design isn’t more complex. It’s just honest about the constraints. Same emission library, same operation definition, two paths behind the emitter, each tuned for its job.
If you’re about to launch usage-based pricing and you’re planning to compute invoices from your trace store, rethink it now. The sooner you split, the cheaper the split.
Related
- NATS vs Kafka vs MQTT: Same Category, Very Different Jobs — why the durability choice on the billing path matters so much.
- RPC vs NATS: It’s Not About Sync vs Async — It’s About Who Owns Completion — completion ownership applies to the emit path, too.
Comments
This space is waiting for your voice.
Comments will be supported shortly. Stay connected for updates!
This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.
Have questions? Reach out through:
Want to see your comment featured? Mention us on X or tag us on Reddit.