Cache Miss, TLB Miss, False Sharing — The Three Invisible Performance Killers

Field notes on the three CPU-level effects that make backend code slower than your profile says — cache misses, TLB misses, and false sharing — demonstrated with real C code and perf measurements via CoreTracer. The same effects compound under multi-tenant AI inference workloads where you don't control which cores you land on.

May 12, 2026

Harrison Guo

3 min read

Kernel Debug Field Notes Performance Analysis

Status: Companion blog draft. Anchors on Obsidian notes [[cacheline]] + [[false cache & pingpong]] + [[cpu]] — real perf data exists (4,369ms → 8,118ms ping-pong test).

Companion assets

Original video:

Cache Miss, TLB Miss & False Sharing: The Ultimate Performance Killers in 3 Minutes!
GitHub: harrison001/CoreTracer — high-performance kernel & assembly debugging toolkit including cacheline, NUMA, lock-free experiments

TL;DR

Three classes of “invisible” performance killers, in order of how often they explain unexpected slowness in production:

Cache miss — your data isn’t in L1/L2/L3 when you need it; CPU stalls hundreds of cycles
TLB miss — your page table entry isn’t in the TLB; CPU walks page tables, costing more cycles than cache miss
False sharing — two threads write to different variables in the same cache line; both cores invalidate each other constantly, turning concurrent code into serial pessimism

Real measurement from the lab: same workload, with vs without false sharing, 4,369 ms → 8,118 ms (1.86× slowdown). Profile-level metrics will not tell you why.

The setup

Two threads doing independent counter increments on adjacent fields of a struct
Run once with fields packed (false sharing) and once with cache-line padding
perf stat -e cache-misses,dTLB-load-misses,L1-dcache-load-misses ./bench
Compare cycles, IPC, cache miss counts

Debug command transcript

# TODO: paste exact perf invocation + CoreTracer benchmark from the video
# perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses,dTLB-load-misses ./coretracer_falseshare
# perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses,dTLB-load-misses ./coretracer_padded

What the data shows

Workload	Wall time	L1 misses	dTLB misses	False sharing?
Packed counters	8,118 ms	high	low	Yes
Padded counters	4,369 ms	low	low	No

Workload Packed counters

Wall time 8,118 ms

L1 misses high

dTLB misses low

False sharing? Yes

Workload Padded counters

Wall time 4,369 ms

L1 misses low

dTLB misses low

False sharing? No

The two threads are doing the same work. The struct layout — specifically, whether two hot fields share a cache line — is the entire difference.

What this teaches backend / AI infra engineers

These three effects are why “I added more cores and it didn’t get faster” happens. The classic versions:

Cache miss on shared lookup tables: a hash map probed by many goroutines, where each lookup pulls a different bucket and the bucket isn’t in L1. Common in routing layers, feature stores, inference cache lookups.
TLB miss in container hosts with high page-table pressure: many small containers + many small mappings = TLB churn. Inference servers handling many tenants hit this.
False sharing in performance counters / metrics: every “metric increment” that lands in a shared cache line with sibling metrics turns into a sequential bottleneck.

The lesson generalizes: memory layout is performance, not “an optimization.” A struct field reorder can be a 2× speedup. A 64-byte padding decision can keep multi-threaded code linear-scalable. For AI infra building inference servers, the same effects compound under multi-tenant load where you don’t control which cores you land on.

The CoreTracer repo has reproducible benchmarks for each of these — clone it, run perf, see the numbers move.

Video: Store→Load Reordering x86 vs ARM64
GitHub: CoreTracer
Companion blog (TBD): “Cache line contention in production — from microbenchmark to actual incident”

Tags: cpu cache tlb false-sharing perf performance memory-model coretracer

🎧 More Ways to Consume This Content

HarrisonSecurityLab Podcast

I occasionally advise small teams on backend reliability, Go performance, and production AI systems. Learn more: /services

Comments

This space is waiting for your voice.

Comments will be supported shortly. Stay connected for updates!

Preview of future curated comments

This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.

Cache Miss, TLB Miss, False Sharing — The Three Invisible Performance Killers

Companion assets

Cache Miss, TLB Miss & False Sharing: The Ultimate Performance Killers in 3 Minutes!

TL;DR

The setup

Debug command transcript

What the data shows

What this teaches backend / AI infra engineers

Related work

🎧 More Ways to Consume This Content

[ Get_One_Essay_A_Week ]

Related Articles

Store→Load Reordering Explained — x86 vs ARM64, Real-World Test

Go Profiling in Anger: pprof, Escape Analysis, and Inlining Without Magic

sync.Pool in Go: When It Actually Helps, and When It Quietly Hurts

Comments

Leave a Comment

[ Connect_With_Me ]