Go Profiling in Anger: pprof, Escape Analysis, and Inlining Without Magic
Most performance advice for Go is ritual — 'use sync.Pool,' 'avoid interface boxing,' 'preallocate slices.' Useful sometimes, hollow often. A production engineer's guide to profiling Go systems with pprof, reading escape analysis output, and understanding when the compiler actually inlines.
Table of Contents
Go’s performance culture has a ritual quality. “Use sync.Pool.” “Avoid interface boxing.” “Preallocate slices.” Copy-pasted from blog posts and applied without measurement. Sometimes helpful. Often hollow.
The honest answer is that Go performance work is mostly just profiling. Good profiling tells you what’s actually slow. Bad profiling — or no profiling — leaves you guessing. The toolchain that Go ships with is genuinely excellent; more engineers should use it, and fewer should follow checklist optimizations they haven’t measured.
This is a practical, end-to-end guide to pprof, escape analysis, and inlining — the three Go-specific tools that answer most performance questions.
tl;dr — Start every Go perf investigation with a CPU pprof of the hot path under realistic load. 80% of issues are obvious in the flame graph. For the remaining 20%, add a heap profile and look for allocation pressure driving GC. Only after you’ve localized the problem with real data should you reach for micro-optimizations: escape analysis via
-gcflags='-m', inlining hints, and targeted benchmark-driven rewrites. Skip the profile step, and you are optimizing the wrong thing.
The Investigation Flow
flowchart TD
Start([Performance concern]) --> CPU[Take CPU profile
-http pprof · 30s under load]
CPU --> Hot{Hot code
obvious?}
Hot -->|Yes| Fix1[Fix the hot path · re-measure]
Hot -->|No · GC high| Heap[Take heap / alloc profile]
Heap --> AllocSite{Specific
alloc site?}
AllocSite -->|Yes| Escape[Check -gcflags='-m'
for that function]
AllocSite -->|No| BenchMicro[Isolate in benchmark
-benchmem · -count=5]
Escape --> Fix2[Fix alloc · re-measure]
BenchMicro --> Fix3[Optimize or accept]
Fix1 --> Verify[Profile again · confirm]
Fix2 --> Verify
Fix3 --> Verify
classDef start fill:#e8f4f8,stroke:#2c5282
classDef action fill:#f0fff4,stroke:#2f855a
classDef verify fill:#fef5e7,stroke:#b7791f
class Start start
class Fix1,Fix2,Fix3 action
class Verify verify
CPU Profiling: The First Thing, Always
Every Go binary can expose a pprof HTTP endpoint in two lines:
import _ "net/http/pprof"
// later
go http.ListenAndServe("localhost:6060", nil)
Under load, grab a CPU profile:
$ go tool pprof -http=:9999 http://localhost:6060/debug/pprof/profile?seconds=30
This opens a flame graph in your browser. The wide blocks are where CPU time is spent. Usually the answer is immediate — “oh, JSON encoding is 40% of my CPU; let me switch to a faster encoder.” Or “regex compilation is in the hot path because someone forgot to pre-compile.”
A few things that look surprising on first profile but shouldn’t:
runtime.mallocgctaking 10%+ is GC pressure. You’re allocating a lot. Look at heap profile next.runtime.scheduleorruntime.findrunnabletaking 5%+ means you have too many goroutines churning. Check if you’re spawning per-request.syscall.Syscallhigh means you’re system-call-heavy — usually I/O. Either buffer/batch, or consider epoll-direct if it’s in your hot path.mutex.Lockvisible means contention. Either shrink the lock hold time or shard the lock.
Don’t guess your way through these. Click into each, read the stack, find the user code that caused it.
Heap Profiling: When CPU Points to GC
If runtime.mallocgc shows up in your CPU profile as a non-trivial chunk, heap profile tells you why:
$ go tool pprof -http=:9999 http://localhost:6060/debug/pprof/heap
$ go tool pprof -http=:9999 http://localhost:6060/debug/pprof/allocs
heap shows current memory usage. allocs shows cumulative allocations since program start — this is usually what you want to optimize.
In the flame graph, look for:
- Specific allocation sites taking disproportionate share. A single line of code creating 50% of allocations is an obvious target.
- Calls to
makeslice,makemap,newobjectwith known-size inputs. If you know the size, preallocate. - Interface boxing in hot paths. Every time you pass a concrete type through an
interface{}argument in a tight loop, the runtime may heap-allocate the boxed value. - String concatenation with
+. This is the textbook preventable allocation — usestrings.Builder.
The goal isn’t “zero allocations” — that’s usually not practical. The goal is “allocations per operation in a tight, repeated path are bounded and understood.”
Escape Analysis: The Compiler’s Story
Go’s compiler decides at compile time whether a variable lives on the stack (free, garbage-collected with the function) or the heap (allocated, GC-tracked). This is called escape analysis.
To see the analysis for your code:
$ go build -gcflags='-m' ./...
Output looks like:
./foo.go:12:6: can inline hotFunction
./foo.go:15:10: &Thing{} escapes to heap
./foo.go:18:14: make([]int, 100) does not escape
./foo.go:22:6: parameter "x" escapes to heap
Key things to read for:
escapes to heap— this allocation is heap-allocated. If it’s in a hot path, investigate.does not escape— stack-allocated, free. You want most short-lived locals to do this.parameter escapes to heap— the caller’s passed value escapes because this function keeps a reference to it. Often fixable by taking a copy or not storing a reference.
The most common surprise: passing a value to a function that eventually hands it to interface{} causes the value to escape. A pattern like:
func log(msg string, args ...interface{}) {...}
func handleRequest(req *Request) {
log("got request", req.ID) // req.ID boxes to interface{} and may escape
}
req.ID escapes because of the ...interface{} argument. In a tight path, this is measurable. Fix: use a typed logger that takes concrete types, or accept the cost because logging on the hot path is usually not the hot path.
Escape analysis is one of those things where reading the output a few times is worth it. You start seeing your code differently.
Inlining: When the Compiler Eliminates the Call
Go’s compiler inlines small functions to avoid call overhead. Seeing what got inlined:
$ go build -gcflags='-m' ./... 2>&1 | grep -E 'can inline|cannot inline'
./foo.go:12:6: can inline hotFunction
./foo.go:18:6: cannot inline bigFunction: function too complex: cost 117 exceeds budget 80
./foo.go:22:6: cannot inline interfacingFunction: call to unknown method
Default budget is 80 AST nodes. Hard blockers:
- Calls through interfaces. The compiler doesn’t know what concrete method gets called. No inlining.
- Calls to functions that contain loops with
for rangeover a channel. Historically blocked, though the mid-stack inliner has improved this. - Recursive functions. Obvious.
- Functions over the budget. Refactor smaller if the call is hot.
When to care:
- Never in normal code. Go inlines what it can; your code runs.
- Sometimes in tight hot loops where the call overhead is 10%+ of the total work. Benchmark shows it.
- Occasionally when you control an interface boundary and can replace it with a concrete type on a hot path.
Don’t structure your code around inlining. Code readability beats hypothetical call-overhead wins in nearly every case.
Benchmarks: The Ground Truth
Every perf claim should be backed by a benchmark. testing.B is the tool:
func BenchmarkEncodeResponse(b *testing.B) {
resp := newResponse()
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
_, _ = encode(resp)
}
}
Run:
$ go test -bench=BenchmarkEncode -benchmem -count=5
-count=5 runs each bench 5 times, so you can compare variance. Don’t trust a single run. Hardware, OS scheduling, thermals — all add noise.
For comparing two implementations:
$ go test -bench=BenchmarkEncodeResponse -benchmem -count=10 ./... > old.txt
# (change code)
$ go test -bench=BenchmarkEncodeResponse -benchmem -count=10 ./... > new.txt
$ benchstat old.txt new.txt
benchstat (golang.org/x/perf/cmd/benchstat) gives you statistical significance. If the difference isn’t statistically meaningful, you didn’t actually improve anything — you just rolled the dice differently.
The 80/20 of Go Performance
After enough of this work, a few patterns dominate the real wins:
- Query shape, not language. A slow endpoint is usually doing 10 DB queries when it could do 1. Go is almost never the bottleneck; the data layer is.
- Network hop count. Every inter-service call adds latency. Merging two small services or co-locating tight integrations beats any language-level optimization.
- Caching at the right layer. A well-placed LRU cache saves more than micro-optimizing the uncached path.
- Preallocating known-size slices/maps.
make([]int, 0, n)when you know n is almost free. The defaultmake([]int, 0)reallocates as you append. - Avoiding interface boxing in loops. This is the one micro-optimization that regularly shows up in real profiles.
Everything else — sync.Pool, escape analysis hand-tuning, loop unrolling — is a long-tail optimization. Worth it when profiling tells you it is. Premature otherwise.
A Habit I Recommend
Before adding any optimization, do exactly three things:
- Take a profile with the optimization off. Save it.
- Apply the optimization.
- Take a profile with the optimization on. Compare.
If the comparison doesn’t show clear improvement on the metric you cared about, revert. Do not add complexity without evidence.
This sounds obvious. Almost nobody does it. Most perf work in Go codebases accumulates dead optimizations that add nothing or actively hurt — but nobody knows which, because nobody benchmarked.
The Habit That Compounds
Go’s performance tooling is better than Go’s performance culture gives it credit for. pprof, escape analysis, inlining diagnostics, and benchmarks are built in. They’re precise. They tell you the truth.
The reason most Go code isn’t as fast as it could be isn’t that Go is slow (it isn’t). It’s that engineers copy-paste optimizations they haven’t measured, call the work done, and move on. The few engineers who profile first and optimize second write code that’s actually fast — and usually simpler than the ritual-heavy version.
Profile first. Everything else follows.
Related
- sync.Pool in Go: When It Actually Helps, and When It Quietly Hurts — the one Go optimization most likely to be misapplied.
- Why Go Handles Millions of Connections: User-Space Context Switching, Explained — understanding the runtime is the prerequisite to understanding profiles.
- Testing Real-World Go Backends Isn’t What Many People Think — benchmarking is the last mile of testing.
Comments
This space is waiting for your voice.
Comments will be supported shortly. Stay connected for updates!
This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.
Have questions? Reach out through:
Want to see your comment featured? Mention us on X or tag us on Reddit.