Mental Models, Demoed

1 · Memory reordering & release/acquire

The wrong mental model: “If I write the payload and thenset the ready flag in source order, any thread that sees the flag must see the payload.”

Run it in relaxed mode and watch the consumer see the flag but read stale data — the two stores became visible out of order. Then flip to release/acquire and watch the bug vanish. This is the exact guarantee the SPSC ring depends on.

Producer

data = 42;
ready = 1;   // plain / relaxed

Consumer

while (!ready)   // plain / relaxed
   ;
use(data);

Global order in which writes actually become visible:

Press run to watch the two threads race.

2 · False sharing

The wrong mental model: “Two threads updating two differentvariables can't slow each other down — they never touch the same address.”

They never touch the same address — but if the two variables share a 64-byte cache line, every write rips the line away from the other core. Run both modes for the same time budget and compare the op counts.

one shared line — owner: A

a

b

core A ops

0

core B ops

0

0 ops0 cache-line transfers — the line ping-pongs every write

t = 0/48

Animated cache line bouncing between two cores — False sharing (animated): two independent variables on one 64-byte line, so the line ping-pongs between the cores' caches on every write — work that looks parallel but serializes.

3 · MMIO reads vs. posted writes

The wrong mental model: “An MMIO register access is just a normal load or store with a slightly slower address.”

A posted write is fire-and-forget; a non-posted read is a full PCIe round-trip the core must stall on. Press play and watch five writes finish while five reads are still grinding through their stalls.

Posted writes(e.g. a doorbell) — fire & forget

W0

W1

W2

W3

W4

issuing…

The CPU issues each write and immediately moves on — the write drains to the device asynchronously.

Non-posted reads (an MMIO register read) — CPU stalls for the completion

stall…

R1

R2

R3

R4

0/5 done

Each read is a full PCIe round-trip; the core cannot retire it until the completion returns — so reads serialize and stall.

t = 0 · 5 writes finish ~6× sooner than 5 reads

The takeaway for the datapath: never poll a NIC register on the hot path. You ring a doorbell(a posted write that doesn't stall), and you learn about completions from DMA'd status / events in host memory — never from an MMIO read. One stray read can cost more than the whole rest of the send.

4 · Cache locality

The wrong mental model: “Every array access costs about the same — cache is just a small constant factor on top of the algorithm's big-O.”

Same 64 elements, same number of accesses — touch them in order vs. at random and watch the cycle count diverge by ~6×. A miss is ~100 cycles; a hit is a handful. Locality is not a constant factor.

Each row of 8 is one 64-byte cache line. red = miss (line fetched from RAM), teal = hit (already in cache).

0 cyc0 misses / 0 accesses

0/64 accessed

5 · TCP congestion control

The wrong mental model: “TCP just sends at the link rate until the receiver or app slows it down.”

TCP probes: it ramps up, overshoots, sees a loss, and halves — forever. Press play and watch the AIMD sawtooth. It's why a latency-sensitive request sharing a queue with a bulk flow suffers.

cwnd = 1 · red dot = packet loss → window halved

The shape is the point.TCP doesn't blast at line rate — it probes: exponential slow start, then linear additive increase, until a loss signals congestion and it multiplicatively decreases (halves). That sawtooth is AIMD. It also means a single low-latency request riding a fat bulk flow inherits that queue — which is why low-latency stacks pace, use small buffers, or bypass TCP entirely (UDP / RDMA) on the hot path.

6 · All-reduce: ring vs. tree

The wrong mental model: “All-reduce is one collective operation, so the algorithm choice is mostly an implementation detail.”

Drag the GPU count up and watch ring steps grow linearly while tree steps grow logarithmically. At cluster scale that's the difference between thousands of hops and twenty — and the collective is on every GPU's critical path.

GPUs (N): 8

Ring

2(N−1) = 14

Tree

2⌈log₂N⌉ = 6

hop 0/14

Why the algorithm isn't “a detail.” Ring all-reduce is bandwidth-optimal (each GPU sends ~2× its data, total independent of N) but takes 2(N−1) latency hops; tree / halving-doubling takes only 2⌈log₂N⌉. At N = 1024 that's 2046 hops vs 20. On a GPU cluster the collective is on the critical path — one slow link or a bad algorithm choice stalls every GPU, which is exactly why AI fabrics obsess over tail latency and in-order delivery.

Animated ring all-reduce: chunks circulate around the GPU ring — Ring all-reduce (animated): chunks circulate around the ring, each GPU accumulating — bandwidth-optimal but 2(N−1) steps. Tree all-reduce trades that for 2·log₂N latency-optimal steps.

The habit underneath all six

Every one of these is the same move: your source code describes intent; the hardware describes reality— and they differ wherever there's a buffer, a cache, a bus, or a queue between them. Senior debugging is knowing where those gaps live. When an answer surprises you, that gap is exactly what the interviewer wants you to name.

1 · Memory reordering & release/acquire

2 · False sharing

3 · MMIO reads vs. posted writes

4 · Cache locality

5 · TCP congestion control

6 · All-reduce: ring vs. tree

The habit underneath all six