Cache Miss, TLB Miss, False Sharing — The Three Invisible Performance Killers
Field notes on the three CPU-level effects that make backend code slower than your profile says — cache misses, TLB misses, and false sharing — demonstrated with real C code and perf measurements via CoreTracer. The same effects compound under multi-tenant AI inference workloads where you don't control which cores you land on.
Status: Companion blog draft. Anchors on Obsidian notes [[cacheline]] + [[false cache & pingpong]] + [[cpu]] — real perf data exists (4,369ms → 8,118ms ping-pong test).
Companion assets
- Original video:
Cache Miss, TLB Miss & False Sharing: The Ultimate Performance Killers in 3 Minutes!
- GitHub: harrison001/CoreTracer — high-performance kernel & assembly debugging toolkit including cacheline, NUMA, lock-free experiments
TL;DR
Three classes of “invisible” performance killers, in order of how often they explain unexpected slowness in production:
- Cache miss — your data isn’t in L1/L2/L3 when you need it; CPU stalls hundreds of cycles
- TLB miss — your page table entry isn’t in the TLB; CPU walks page tables, costing more cycles than cache miss
- False sharing — two threads write to different variables in the same cache line; both cores invalidate each other constantly, turning concurrent code into serial pessimism
Real measurement from the lab: same workload, with vs without false sharing, 4,369 ms → 8,118 ms (1.86× slowdown). Profile-level metrics will not tell you why.
The setup
- Two threads doing independent counter increments on adjacent fields of a struct
- Run once with fields packed (false sharing) and once with cache-line padding
perf stat -e cache-misses,dTLB-load-misses,L1-dcache-load-misses ./bench- Compare cycles, IPC, cache miss counts
Debug command transcript
# TODO: paste exact perf invocation + CoreTracer benchmark from the video
# perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses,dTLB-load-misses ./coretracer_falseshare
# perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses,dTLB-load-misses ./coretracer_padded
What the data shows
| Workload | Wall time | L1 misses | dTLB misses | False sharing? |
|---|---|---|---|---|
| Packed counters | 8,118 ms | high | low | Yes |
| Padded counters | 4,369 ms | low | low | No |
The two threads are doing the same work. The struct layout — specifically, whether two hot fields share a cache line — is the entire difference.
What this teaches backend / AI infra engineers
These three effects are why “I added more cores and it didn’t get faster” happens. The classic versions:
- Cache miss on shared lookup tables: a hash map probed by many goroutines, where each lookup pulls a different bucket and the bucket isn’t in L1. Common in routing layers, feature stores, inference cache lookups.
- TLB miss in container hosts with high page-table pressure: many small containers + many small mappings = TLB churn. Inference servers handling many tenants hit this.
- False sharing in performance counters / metrics: every “metric increment” that lands in a shared cache line with sibling metrics turns into a sequential bottleneck.
The lesson generalizes: memory layout is performance, not “an optimization.” A struct field reorder can be a 2× speedup. A 64-byte padding decision can keep multi-threaded code linear-scalable. For AI infra building inference servers, the same effects compound under multi-tenant load where you don’t control which cores you land on.
The CoreTracer repo has reproducible benchmarks for each of these — clone it, run perf, see the numbers move.
Related work
- Video: Store→Load Reordering x86 vs ARM64
- GitHub: CoreTracer
- Companion blog (TBD): “Cache line contention in production — from microbenchmark to actual incident”
🎧 More Ways to Consume This Content
I occasionally advise small teams on backend reliability, Go performance, and production AI systems. Learn more: /services
Comments
This space is waiting for your voice.
Comments will be supported shortly. Stay connected for updates!
This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.
Have questions? Reach out through:
Want to see your comment featured? Mention us on X or tag us on Reddit.
Leave a Comment