Cache Miss, TLB Miss, False Sharing — The Three Invisible Performance Killers

Cache Miss, TLB Miss, False Sharing — The Three Invisible Performance Killers

Field notes on the three CPU-level effects that make backend code slower than your profile says — cache misses, TLB misses, and false sharing — demonstrated with real C code and perf measurements via CoreTracer. The same effects compound under multi-tenant AI inference workloads where you don't control which cores you land on.

May 12, 2026
Harrison Guo
3 min read
Kernel Debug Field Notes Performance Analysis

Status: Companion blog draft. Anchors on Obsidian notes [[cacheline]] + [[false cache & pingpong]] + [[cpu]] — real perf data exists (4,369ms → 8,118ms ping-pong test).

Companion assets

  • Original video:

    Cache Miss, TLB Miss & False Sharing: The Ultimate Performance Killers in 3 Minutes!

  • GitHub: harrison001/CoreTracer — high-performance kernel & assembly debugging toolkit including cacheline, NUMA, lock-free experiments

TL;DR

Three classes of “invisible” performance killers, in order of how often they explain unexpected slowness in production:

  1. Cache miss — your data isn’t in L1/L2/L3 when you need it; CPU stalls hundreds of cycles
  2. TLB miss — your page table entry isn’t in the TLB; CPU walks page tables, costing more cycles than cache miss
  3. False sharing — two threads write to different variables in the same cache line; both cores invalidate each other constantly, turning concurrent code into serial pessimism

Real measurement from the lab: same workload, with vs without false sharing, 4,369 ms → 8,118 ms (1.86× slowdown). Profile-level metrics will not tell you why.

The setup

  • Two threads doing independent counter increments on adjacent fields of a struct
  • Run once with fields packed (false sharing) and once with cache-line padding
  • perf stat -e cache-misses,dTLB-load-misses,L1-dcache-load-misses ./bench
  • Compare cycles, IPC, cache miss counts

Debug command transcript

# TODO: paste exact perf invocation + CoreTracer benchmark from the video
# perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses,dTLB-load-misses ./coretracer_falseshare
# perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses,dTLB-load-misses ./coretracer_padded

What the data shows

Workload Wall time L1 misses dTLB misses False sharing?
Packed counters 8,118 ms high low Yes
Padded counters 4,369 ms low low No
Workload Packed counters
Wall time 8,118 ms
L1 misses high
dTLB misses low
False sharing? Yes
Workload Padded counters
Wall time 4,369 ms
L1 misses low
dTLB misses low
False sharing? No

The two threads are doing the same work. The struct layout — specifically, whether two hot fields share a cache line — is the entire difference.

What this teaches backend / AI infra engineers

These three effects are why “I added more cores and it didn’t get faster” happens. The classic versions:

  • Cache miss on shared lookup tables: a hash map probed by many goroutines, where each lookup pulls a different bucket and the bucket isn’t in L1. Common in routing layers, feature stores, inference cache lookups.
  • TLB miss in container hosts with high page-table pressure: many small containers + many small mappings = TLB churn. Inference servers handling many tenants hit this.
  • False sharing in performance counters / metrics: every “metric increment” that lands in a shared cache line with sibling metrics turns into a sequential bottleneck.

The lesson generalizes: memory layout is performance, not “an optimization.” A struct field reorder can be a 2× speedup. A 64-byte padding decision can keep multi-threaded code linear-scalable. For AI infra building inference servers, the same effects compound under multi-tenant load where you don’t control which cores you land on.

The CoreTracer repo has reproducible benchmarks for each of these — clone it, run perf, see the numbers move.


🎧 More Ways to Consume This Content

I occasionally advise small teams on backend reliability, Go performance, and production AI systems. Learn more: /services

Comments

This space is waiting for your voice.

Comments will be supported shortly. Stay connected for updates!

Preview of future curated comments

This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.