Building & Debugging a Custom ARM64 Linux Kernel — Yocto, QEMU, GDB
Field notes from building an ARM64 Linux kernel with Yocto, booting it in QEMU, and walking through it with GDB. The point isn't the recipe — it's what the workflow reveals about the kernel/user boundary that production backend engineers usually never see.
Status: Companion blog draft for the existing video. Long-form transcript + bridge framing TBD.
Companion assets
- Original video:
How to Build & Debug a Custom ARM64 Linux Kernel with Yocto, QEMU, and GDB
- GitHub: harrison001/CoreTracer — companion kernel-debug toolkit (CPU affinity, NUMA, lock-free, cacheline experiments)
TL;DR
You can rebuild and debug an ARM64 Linux kernel end-to-end on a laptop without dedicated hardware. The workflow matters less than what it surfaces: scheduler decisions, context-switch cost, and kernel/user boundary behavior — the same forces that show up as p99 latency in production Go services and as unpredictable token-streaming latency in AI inference pipelines.
The setup
- Yocto build → custom ARM64 kernel image with debug symbols
- QEMU running the image
- GDB attached over the QEMU stub port
- Workflow:
(gdb) target remote :1234→ set kernel breakpoints → step through schedule()
Debug command transcript
# TODO: paste the actual yocto recipe + qemu-system-aarch64 invocation + gdb attach sequence from the video
# bitbake harrison-image
# runqemu qemuarm64 nographic kvm slirp
# aarch64-linux-gnu-gdb vmlinux
# (gdb) target remote :1234
# (gdb) break schedule
# (gdb) c
What broke (or rather, what was opaque before)
ARM64 kernel debug on a developer laptop is usually a black box because:
- Cross-toolchain mismatches make symbols misalign with the running kernel
- Without KASLR disabled, GDB and the kernel disagree about where functions live
- Yocto’s default kernel config strips out debug info needed for inline-function unwinding
What fixed it
TODO: write up the specific config changes (CONFIG_DEBUG_INFO, nokaslr boot param, matching toolchain version, etc.) and the GDB workflow that made stepping through schedule() actually informative.
What this teaches backend / AI infra engineers
Most production engineers never look below the syscall boundary. They see Go goroutine yields, container CPU throttling, and unexplained latency spikes — and reach for pprof, then for “the cluster is loaded.” But the actual mechanism is in the kernel scheduler: which CPU your thread runs on, how often it migrates, when the kernel preempts vs. yields.
For AI infrastructure specifically: every LLM inference call is a userspace-to-kernel-to-userspace round trip many times (file I/O, network, GPU driver). The latency variability you blame on the model is often kernel-side scheduling, not the model.
Knowing how to attach a debugger to the kernel and watch schedule() fire isn’t trivia — it’s the only way to make confident claims about where time is actually spent.
Related work
- Video: Debugging Ubuntu 6.8 x86_64 Kernel with GDB & QEMU — sibling debug workflow on x86
- GitHub: CoreTracer — reproducible kernel-debug experiments
🎧 More Ways to Consume This Content
I occasionally advise small teams on backend reliability, Go performance, and production AI systems. Learn more: /services
Comments
This space is waiting for your voice.
Comments will be supported shortly. Stay connected for updates!
This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.
Have questions? Reach out through:
Want to see your comment featured? Mention us on X or tag us on Reddit.
Leave a Comment