Store→Load Reordering Explained — x86 vs ARM64, Real-World Test

Field notes on what happens when CPUs reorder your instructions. x86 has TSO (Total Store Order, almost). ARM64 is weakly ordered. Run the same lock-free code on both and the bug only appears on ARM. This is not academic — it's why some Go programs ship to production fine on Intel and break on M-series Macs or Graviton.

May 12, 2026

Harrison Guo

3 min read

Kernel Debug Field Notes Performance Analysis

Status: Companion blog draft. Anchors on Obsidian note [[fence]] — the r1==0 && r2==0 impossibility proof.

Companion assets

Original video:

Store→Load Reordering EXPLAINED: x86 vs ARM64 Real-World Test!
GitHub: harrison001/CoreTracer — includes memory-model experiments

TL;DR

x86 and ARM64 do not agree on what your code means under concurrent execution.

x86 (TSO model): stores are observed in program order by other cores; only store→load reordering is permitted, and even that is restricted.
ARM64 (weakly ordered): stores and loads can be reordered freely unless you insert explicit barriers (DMB, DSB).

The same lock-free C program that runs millions of iterations correctly on Intel can deadlock or produce impossible-on-x86 results after one minute on ARM64 — without changing the source.

The setup

A classic memory-model test:

// Thread 1                 // Thread 2
x = 1;                      y = 1;
r1 = y;                     r2 = x;

On a system with sequential consistency, r1 == 0 && r2 == 0 is impossible (one of the stores must complete before both loads). On real hardware:

x86: r1==0 && r2==0 happens, but rarely (the only allowed reordering is store→load)
ARM64: r1==0 && r2==0 happens frequently

Debug command transcript

# TODO: paste actual test harness + perf invocation from the video
# Run a tight loop of the above test, count occurrences where r1==0 && r2==0
# Compare counts on x86 vs ARM64 (or Apple Silicon)
# ./reorder_test -arch=x86_64
# ./reorder_test -arch=arm64

What the data shows

(Placeholder — paste actual numbers from video.)

Platform	Iterations	r1==0 && r2==0 hits
x86_64 (Intel)	100M	~50 (very rare)
ARM64 (Apple Silicon / Graviton)	100M	~hundreds of thousands (common)

Platform x86_64 (Intel)

Iterations 100M

r1==0 && r2==0 hits ~50 (very rare)

Platform ARM64 (Apple Silicon / Graviton)

Iterations 100M

r1==0 && r2==0 hits ~hundreds of thousands (common)

Why the fence ([[fence]] in Obsidian) makes r1==0 && r2==0 actually impossible

Insert an mfence (x86) or DMB ISH (ARM64) between the store and the load on each thread, and the case becomes provably impossible — the fence forces a global serialization point at which all earlier stores are visible.

The Obsidian field note covers the formal argument: with sequential consistency, you can always linearize the four ops; if r1==0 then the load of y happened before the store of y=1, meaning thread 2’s y=1 came after thread 1’s whole sequence, meaning x=1 was visible before thread 2’s r2 = x, contradiction.

What this teaches backend / AI infra engineers

You don’t write inline memory barriers in production Go. But:

Atomics in Go and Rust map to specific hardware instructions; on x86 these are often “free” (the CPU already does most of what you need), on ARM64 they emit explicit barriers and cost cycles
Lock-free queues that test fine on x86-based CI can deadlock on ARM-based prod. AWS Graviton, Apple Silicon, GCP Tau T2A all expose this. Real production incidents trace to “the same code works on dev’s MacBook Intel and breaks on Graviton.”
AI infra running on multi-arch fleets: inference servers increasingly run on ARM (Graviton, NVIDIA Grace). Code written assuming x86’s memory model has latent bugs that surface only under specific scheduling.

The lesson: portability between x86 and ARM is not “recompile and run.” The compiler honors the source-language memory model (Go, Rust, C++11+); the bugs that surface on ARM are bugs that the source language always permitted but x86’s stronger model accidentally hid.

Video: Cache Miss, TLB Miss & False Sharing
Obsidian: [[fence]] — fence formal argument and proof
GitHub: CoreTracer

Tags: memory-model x86 arm64 concurrency lock-free reordering tso

🎧 More Ways to Consume This Content

HarrisonSecurityLab Podcast

I occasionally advise small teams on backend reliability, Go performance, and production AI systems. Learn more: /services

Comments

This space is waiting for your voice.

Comments will be supported shortly. Stay connected for updates!

Preview of future curated comments

This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.

Store→Load Reordering Explained — x86 vs ARM64, Real-World Test

Companion assets