๐Ÿง CPU Microarchitecture for Low Latency

Probes whether the candidate can connect CPU ordering, caches, NUMA, and MMIO behavior to real NIC fast-path correctness and latency.

Read and write on each core and watch the cache line move through the MESI states.

Core A
I
Invalid
no valid data
Core B
I
Invalid
no valid data
main memoryup to date
A: I ยท B: I
Cache line starts Invalid in both cores; the copy lives only in memory.
MESI invariant: at most one core may hold the line M or E. A write needs exclusive ownership, so it invalidates every other copy first.
You have a single-producer/single-consumer ring between two cores. The producer writes descriptors and then advances `prod`. What memory ordering is required on x86, Arm, and RISC-V, and why is `volatile` the wrong answer?staff

The invariant is: descriptor contents must become visible before the producer publishes the index, and the consumer must not read descriptors before observing the published index.

On x86 write-back memory, TSO already orders normal stores before later normal stores, so the producer usually needs a compiler barrier or C11 release store, not an mfence, for CPU-to-CPU publication. The consumer uses acquire when reading prod so the compiler and weak ISAs do not move descriptor loads before the index load.

On Arm, use stlr/release store or dmb ishst before the index store, and ldar/acquire load or dmb ishld on the consumer. A stlr paired with ldar is also multi-copy-atomic, which a plain dmb does not guarantee. On RISC-V, use acquire/release atomics where available, or fences such as fence rw,w before publishing and fence r,rw after consuming depending on the mapping.

volatile only constrains some compiler optimizations for that object. It does not create an inter-thread happens-before relation, does not emit the right hardware barriers on weakly ordered CPUs, and does not make a torn or racing non-atomic index safe. On x86 it accidentally looks correct because TSO does the ordering for you; ship the same code to Arm and it races.

struct ring {
    desc_t d[1024];
    _Atomic uint32_t prod;
    _Atomic uint32_t cons;
};

void produce(struct ring *r, uint32_t i, desc_t x) {
    r->d[i & 1023] = x;
    atomic_store_explicit(&r->prod, i + 1, memory_order_release);
}

bool consume(struct ring *r, uint32_t *i, desc_t *out) {
    uint32_t p = atomic_load_explicit(&r->prod, memory_order_acquire);
    if (*i == p) return false;
    *out = r->d[*i & 1023];
    (*i)++;
    return true;
}
What they're listening for: A strong answer separates compiler ordering, CPU ordering, cache coherence, and atomicity. The trap is over-fencing on x86 while under-fencing on Arm/RISC-V.
Follow-ups
  • How does the answer change when descriptors are read by a PCIe device via DMA?
  • What breaks if the index and a hot statistic share one cache line?
  • When would `memory_order_relaxed` still be correct?
A polling loop reads a NIC completion queue in cacheable host memory and occasionally sees a completion before its packet metadata appears valid. What ordering and ownership bugs would you investigate?senior

First distinguish CPU-owned memory from device-owned DMA memory. For device writes into coherent host memory, the driver must not consume the rest of a completion until the device-owned valid bit, generation bit, or phase bit says the entry is complete. Then it needs the DMA read barrier used by that environment: in the Linux kernel this is typically dma_rmb() before reading the remaining fields after the ownership bit.

Common bugs are reading fields before checking the generation bit, using normal rmb() where a DMA-specific barrier is needed on an architecture with weaker device ordering, missing endian conversion, reusing the descriptor before the device has stopped owning it, or allowing the compiler to cache a CQ field across polls.

I would also verify that the CQ memory is mapped with the intended attributes, not write-combining or uncached by accident, and that the device's completion format really writes the owner/status field last. PCIe keeps a single posted-write stream ordered, so a well-designed device writes payload before owner bit; but if the device splits the descriptor across multiple TLPs or uses relaxed-ordering (RO) attribute bits, that guarantee can break. If the hardware writes status first, software cannot fix it with a CPU barrier; the protocol is wrong.

What they're listening for: This checks whether they know barriers order the observer, not magic visibility, and whether they understand DMA ownership protocols.
Follow-ups
  • Why is a phase bit preferable to clearing entries in a hot CQ?
  • Where would you put prefetches in the poll loop?
  • What evidence would distinguish hardware write ordering from a software race?
Explain how false sharing shows up in an ultra-low-latency network path, and give concrete layout rules for ring indices, statistics, and per-queue state.senior

False sharing is coherence traffic caused by independent variables occupying the same cache line. A 64-byte line is common on x86 server CPUs, so one core updating rx_prod can invalidate another core's read-mostly rx_cons or statistics even though there is no logical sharing. Note many AMD and Intel parts also fetch in 128-byte pairs (adjacent-line prefetch / spatial prefetcher), so aligning the producer and consumer halves to separate 128-byte regions is sometimes safer than 64.

Rules I would use:

  • Put producer-owned and consumer-owned indices on separate cache lines.
  • Keep per-packet hot fields away from slow-path counters, timestamps, and debug flags.
  • Use per-core or per-queue counters and aggregate out of band.
  • Pad arrays of queue state so queue 0 and queue 1 do not share a line.
  • Verify object placement with pahole, perf c2c, and cache-line address offsets, not by inspection alone.
struct alignas(64) ring {
    _Atomic uint32_t prod;      /* written by producer core */
    char _pad0[64 - sizeof(uint32_t)];
    _Atomic uint32_t cons;      /* written by consumer core */
    char _pad1[64 - sizeof(uint32_t)];
};

The tradeoff is footprint. Padding a few hot queue structures is cheap; padding millions of descriptors can destroy cache residency and TLB locality, so the win flips to a loss.

What they're listening for: A senior engineer should describe ownership and write frequency, not just say 'add padding everywhere'.
Follow-ups
  • How would `perf c2c` present this problem?
  • When can true sharing be cheaper than duplicating state?
  • What changes on SMT siblings versus separate sockets?
A doorbell register is mapped write-combining. Why can this reduce overhead, and what ordering hazards does it introduce for a NIC transmit path?staff

Write-combining lets the CPU merge adjacent or repeated stores in a WC buffer and avoid waiting for each MMIO write to drain as a separate transaction. That helps a doorbell-heavy transmit path, and it is the mechanism behind write-combined TX descriptor pushes where the whole descriptor is streamed in one burst.

The hazard is that WC stores are weakly ordered compared with normal write-back memory and may sit in a WC buffer that flushes on its own schedule. Before ringing the doorbell, descriptor writes in host memory must be visible to the device; in the Linux kernel that is dma_wmb() before the MMIO write. If the doorbell itself is WC, an sfence (or the architecture-specific MMIO accessor semantics) may be needed to push the WC buffer out so the doorbell reaches the device before software assumes it has.

Use the platform accessors (writel(), writeq(), and wmb()/mmiowb() semantics folded into modern accessors) rather than open-coded pointer stores unless the mapping and ordering contract is deliberately controlled. Also avoid read-after-write flushes in the hot path unless required: a posted PCIe write flushed by reading back a device register turns a fire-and-forget write into a non-posted round trip across the link, which can add hundreds of nanoseconds to microseconds.

What they're listening for: The nuance is that write-combining is a throughput optimization that can hurt determinism and correctness if treated like ordinary cacheable memory.
Follow-ups
  • What is the difference between posted MMIO writes and cached stores?
  • When would batching doorbells increase p99 latency?
  • Why might a read from the same BAR be used after a reset write but not per packet?
How do store buffers and store-to-load forwarding affect a tight packet-processing loop, and what symptoms suggest a forwarding failure?senior

A store buffer lets the core retire stores before they reach L1/coherence, reducing stalls. Later loads can forward from the store buffer if address, size, alignment, and overlap rules are friendly. If forwarding fails, the load stalls until the store commits to L1 and the load can re-issue, typically on the order of ten-plus cycles versus a few-cycle forward, a small but very visible bubble in a nanosecond-scale loop.

The classic failure is a narrow store followed by a wider, overlapping load (write four header bytes, then read an eight-byte word covering them), or a misaligned load that straddles the store. x86 generally cannot forward when a load needs bytes from more than one store, or when the load is wider than and not contained in a single store. Symptoms: high cycles with no obvious cache misses, sensitivity to alignment, and improvement after reordering fields or using full-width aligned stores. perf exposes ld_blocks.store_forward on Intel for exactly this.

In packet code, avoid write-then-read of overlapping header bytes in the fast path. Build the value in a register, write once at natural alignment, and avoid packed unaligned accesses unless the wire format forces them at the boundary.

What they're listening for: This separates cache miss thinking from pipeline forwarding hazards, which are easy to miss in low-latency code.
Follow-ups
  • Why can `memcpy` beat hand-written byte stores here?
  • How would you design a microbenchmark for this?
  • What compiler transformations could hide or create the issue?
A queue pair performs well when both threads run on one socket but p99 doubles when split across sockets. Walk through the NUMA and coherence causes you would test.senior

I would check three locality domains: CPU execution, memory allocation, and device attachment. The NIC's PCIe root complex belongs to a NUMA node; RX/TX rings, packet buffers, and the polling thread should usually be local to that node. Remote memory adds interconnect hops, and remote cache-line ownership transfer makes producer/consumer indices much more expensive.

Tests:

  • Confirm CPU and NIC topology with lstopo, numactl -H, /sys/bus/pci/devices/.../numa_node.
  • Pin poll threads and IRQs to cores local to the NIC.
  • Allocate hugepage or DMA memory on the same node with numactl --membind or driver-local allocation.
  • Split queue state so cross-socket writes are minimized.
  • Compare median, p99, and cache-to-cache/HITM events, not just throughput.

The fix is not always 'same socket'. If an application thread must run elsewhere, batching and ownership handoff may beat cache-line ping-pong. On modern AMD EPYC the same reasoning applies one level down: crossing CCX/CCD boundaries within a socket already adds Infinity-Fabric latency, so 'same socket' is necessary but not sufficient.

Local vs remote NUMA access across the interconnect.
Local vs remote NUMA access across the interconnect.
What they're listening for: A strong answer connects PCIe locality, allocator policy, and coherence ownership rather than treating NUMA as just DRAM latency.
Follow-ups
  • How can automatic NUMA balancing hurt a pinned low-latency process?
  • What does first-touch allocation imply for hugepage pools?
  • When would you duplicate a ring instead of sharing it cross-socket?
Where can hardware prefetch help a network datapath, where can it hurt, and when would you add software prefetch?senior

Hardware prefetch works well for predictable streams: descriptor rings, contiguous packet metadata, and linear scans. It works poorly for pointer-chasing, random flow-table lookups, and data-dependent next-hop structures. It can hurt by pulling unused packet payload into cache, evicting hot ring state, or consuming memory bandwidth on a core that is already latency-bound.

Software prefetch is useful when the program knows the future address earlier than hardware can infer it: for example, prefetch the next descriptor and its packet buffer header while processing the current one. The distance must match the loop body and memory latency; too near is useless, too far pollutes cache or crosses ownership before the entry is valid. On x86, prefetcht0 brings to all levels, prefetcht2 to lower levels, and prefetchnta uses a non-temporal hint to limit pollution, which is what you often want for payload you will touch once.

for (i = 0; i < n; i++) {
    __builtin_prefetch(&desc[i + PREFETCH_AHEAD], 0, 0); /* read, NTA */
    process(&desc[i]);
}

I would measure with and without prefetch across packet and burst sizes. If prefetch improves average throughput but worsens p99 under incast, it may not belong in the low-latency profile.

What they're listening for: The candidate should treat prefetch as a measured scheduling hint with cache pollution risk, not a universal optimization.
Follow-ups
  • Would you prefetch for read, write, or both?
  • How do DDIO or cache-injected DMA writes change the tradeoff?
  • What perf events would you inspect?
What is x86 TSO, and what common lock-free algorithm mistake appears only when code is moved to Arm or RISC-V?staff

x86 TSO is a relatively strong model for normal write-back memory: loads are not reordered with other loads, stores are seen in order, and stores are not reordered with later stores. The main relaxation is that a later load may execute before an earlier store to a different address via the store buffer (the StoreLoad reordering, the one thing mfence or a lock op restores). Locked operations are stronger. x86 is also multi-copy-atomic: a store becomes visible to all other cores at once.

A common bug is publishing a pointer or index with a plain store after initializing an object, then having consumers read the pointer and object fields with plain loads. On x86 this often appears to work because store-store and load-load ordering are strong. On Arm or RISC-V the consumer can observe the pointer before the initialized fields unless the publish is release and the consume is acquire. Older Arm was even non-multi-copy-atomic, so a store could reach some cores before others; ARMv8 was revised to be multi-copy-atomic, but only stlr/ldar pairs give you that guarantee cheaply.

Another bug is assuming MMIO and DMA memory follow the same ordering as cacheable memory. Device memory has separate rules and often needs dma_wmb(), dma_rmb(), writel()/readl(), or architecture-specific barriers.

What they're listening for: This probes whether the candidate can reason portably without cargo-culting fences.
Follow-ups
  • Why is `memory_order_consume` rarely used in portable C?
  • What does multi-copy atomicity buy you?
  • How would you prove the bug with a litmus test?
Speculative execution and memory disambiguation make code faster. How can they mislead low-latency measurements or create correctness issues around MMIO?senior

Speculation can execute loads before the branch that logically guards them, and memory disambiguation can issue a load before older stores when the core predicts no alias. For normal memory, mis-speculation is rolled back architecturally, but the timing effects remain: cache lines may be touched, branch predictors trained, and measured code may look faster or noisier depending on warmup.

For MMIO, drivers must use the correct accessors and memory types so the CPU does not speculate, combine, or reorder accesses in a way the device protocol cannot tolerate. A speculative load to a device register with read side effects (a read-to-clear status, a FIFO pop) is a correctness bug, which is one reason device BARs are mapped device/uncacheable, not write-back. Never use ordinary C loads/stores to device registers unless the mapping and architecture contract explicitly allow it.

For measurements, serialize timestamp reads, pin the thread, warm caches deliberately, randomize branch cases if needed, and separate measuring the intended work from measuring predictor or cache state left by previous iterations.

What they're listening for: The key is understanding that architectural correctness and microarchitectural side effects are different things.
Follow-ups
  • Why can a bounds check fail to protect a cache side channel?
  • What is the cost of adding a speculation barrier in a hot path?
  • How would you keep MMIO out of a compiler's common-subexpression assumptions?
AMD EPYC uses MOESI, and x86 server caches are typically MESI/MESIF. What does the extra Owned (O) state change, and why does it matter for a producer/consumer NIC ring shared across cores?staff

In MESI a dirty line lives in Modified in exactly one cache. When another core reads it, the holder must either write the line back to memory and downgrade to Shared, or forward and write back. MESIF (Intel) adds Forward, designating one Shared copy as the responder for clean lines so you avoid multiple caches all answering, but a dirty line still has to be written back to be shared.

MOESI (AMD) adds Owned: a line can be dirty with respect to memory and still Shared across caches, with one cache in Owned holding the authoritative dirty copy and responsible for eventually writing back. This is dirty sharing: the owner can forward modified data cache-to-cache without a memory write-back on every reader.

Why it matters for a NIC ring: the hot pattern is one core writing an index/descriptor and another core reading it. Under MOESI, that line can bounce Owned-to-Shared cache-to-cache without round-tripping to DRAM, which lowers the cost of the inevitable cross-core line transfer. It does not make false sharing free, the line still ping-pongs, but it changes the cost of the transfer and what perf c2c reports as remote HITM. The practical takeaway is unchanged: separate producer and consumer state by cache line; coherence-protocol cleverness reduces the penalty but never removes it.

Core A
I
Invalid
no valid data
Core B
I
Invalid
no valid data
main memoryup to date
A: I ยท B: I
Cache line starts Invalid in both cores; the copy lives only in memory.
MESI invariant: at most one core may hold the line M or E. A write needs exclusive ownership, so it invalidates every other copy first.
What they're listening for: Distinguishes someone who has only memorized MESI from someone who understands why AMD's protocol changes cache-to-cache transfer cost. Bonus if they connect it to perf c2c HITM accounting.
Follow-ups
  • Why is dirty sharing a bandwidth win on a many-core socket?
  • How does the Owned state show up differently in `perf c2c` than a clean shared line?
  • Does MOESI change the ordering guarantees a programmer sees? (No, only the transport.)
Sketch the cache and interconnect hierarchy of a modern AMD EPYC part (core, CCX, CCD, IOD, sockets). Where are the latency cliffs, and how do they constrain where you pin a NIC poll thread and place its buffers?staff

The hierarchy, fastest to slowest: per-core L1 and L2 (private), then a shared L3 within a CCX (a cluster of cores), then a CCD (one or two CCX chiplets) connected over Infinity Fabric to a central I/O die (IOD) that hosts the memory controllers and PCIe root complexes, then a second socket across the inter-socket fabric.

The latency cliffs are the boundaries. Within a CCX, cores share L3 and core-to-core is cheap. Crossing to another CCX or CCD means the line transfer and any L3 hit go over Infinity Fabric, adding tens of nanoseconds; published consumer-part numbers show inter-CCD core-to-core jumping from roughly 75 ns to roughly 180 ns on some generations before microcode/AGESA fixes brought it back near 75 ns, so the exact number is part- and firmware-specific. Crossing sockets adds the most.

Constraints for a NIC poll thread:

  • Pin the poll thread and its paired application/consumer thread to cores in the same CCX so the shared L3 absorbs the ring ping-pong instead of the fabric.
  • Place RX/TX rings, descriptor pools, and packet buffers on the NUMA node whose IOD owns that NIC's PCIe root (/sys/bus/pci/devices/.../numa_node).
  • Steer the NIC's MSI-X vectors to the same local cores via IRQ affinity, so DDIO/DCA injection lands in the right L3.
  • Treat 'same socket' as necessary but not sufficient: the CCX/CCD boundary is a real cliff inside one socket.
lstopo --of console        # see CCX/CCD/L3 grouping
cat /sys/bus/pci/devices/0000:c1:00.0/numa_node
numactl --cpunodebind=2 --membind=2 ./poller
What they're listening for: Staff-level AMD-platform awareness: they should name the CCX/CCD/IOD layers, know the cliff is intra-socket, and tie pinning + buffer placement + IRQ affinity together rather than just saying 'use NUMA'.
Follow-ups
  • Why can two cores in different CCXs but the same socket still suffer ping-pong?
  • How does the IOD location of PCIe roots affect which cores you choose?
  • What changes with a single-CCD vs multi-CCD SKU for a latency-sensitive app?
Inbound DMA can land in LLC via DDIO/DCA instead of DRAM. Explain the latency win, the failure mode at high bandwidth, and how you would tune it for a latency-sensitive receiver.staff

Direct cache injection (Intel DDIO, and analogous data-direct paths on other vendors) writes inbound packet data straight into a portion of the last-level cache rather than DRAM. The win is that the receiving core's first touch of the descriptor and headers is an LLC hit instead of a DRAM miss, cutting tens to ~100 ns off the critical path and avoiding a DRAM write plus read for data the CPU consumes immediately.

The failure mode is cache pollution under load. Injection is restricted to a limited set of LLC ways so I/O cannot evict the entire cache, but at multi-hundred-gigabit rates the NIC fills those ways faster than software drains them. Lines get evicted to DRAM before the core reads them (DDIO leak / LLC thrash), so you pay a DRAM round trip anyway plus contention. Studies show DDIO can cut I/O-bound latency by up to ~30% at 100G but increase tail latency by ~30% at 200G, exactly the regime where the injection ways overflow.

Tuning for a latency-sensitive receiver:

  • Keep working set small: post few RX buffers, drain promptly with busy-polling, and process in small bursts so injected lines are consumed before eviction.
  • Pin RX queues to cores whose LLC the data is injected into (correct NUMA/IOD locality and IRQ affinity).
  • Where the platform exposes it, bound or partition the I/O injection ways (Intel CAT / I/O way control) so a bursty bulk queue does not evict a latency queue's lines.
  • Measure LLC hit rate on first touch and remote-DRAM events, not just throughput; if injected lines are missing, either reduce in-flight buffers or accept bypass-to-DRAM for the bulk queue.
What they're listening for: They should know cache injection is a finite resource that inverts from a win to a tail-latency liability at high bandwidth, and propose working-set and way-partitioning controls rather than 'turn DDIO on'.
Follow-ups
  • Why does posting fewer RX descriptors sometimes lower latency here?
  • How would CAT/way-partitioning isolate a latency queue from a bulk queue?
  • What PMU events tell you injected lines are being evicted before use?
What does a `lock`-prefixed atomic (or `lock xadd` on a shared counter) actually cost on x86, and why can a single shared atomic statistic dominate a fast path that otherwise has no cache misses?senior

A lock-prefixed op on a write-back line is an atomic read-modify-write: the core must hold the line in a writable (Modified/Exclusive) state for the duration, and the operation has full-barrier semantics (it drains the store buffer and acts as a StoreLoad fence). Modern x86 implements this by locking the cache line, not the bus, when the access fits in one line; a misaligned op straddling two lines or crossing a page can escalate to a far more expensive split-lock / bus lock.

The cost on an uncontended, already-owned line is modest, low tens of cycles, dominated by the fence and the RMW. The killer is contention. If two cores both lock xadd the same counter, the line must transfer to a writable state on each op, so it ping-pongs cache-to-cache. Each transfer is a coherence miss (tens of ns cross-CCX/socket), and the ops serialize because only one core can own the line writable at a time. A loop that is otherwise all L1 hits suddenly stalls on one line, and it scales negatively: more cores make it worse.

The fix in a datapath is to never share a hot counter. Use per-core or per-queue counters in separately-cache-lined storage, increment them with plain (or relaxed-atomic) ops, and aggregate out of band in a slow path. That turns a contended RMW into a private L1 write.

What they're listening for: Senior+ if they separate the uncontended fence cost from the contended ping-pong, mention split locks, and reach for per-core counters instead of a shared atomic.
Follow-ups
  • Why is `lock xadd` on a private line cheap but on a shared line catastrophic?
  • What is a split lock and why do data centers alarm on it?
  • When is a single shared atomic acceptable in a fast path?