๐Ÿ”’C11/C++11 Memory Model and Lock-Free Concurrency

Probes whether a senior engineer can connect language-level atomics to hardware ordering, progress guarantees, reclamation, and NIC datapath races.

Run the producer/consumer race: relaxed gives a stale read; release/acquire fixes it.

Producer
data = 42;
ready = 1;   // plain / relaxed
Consumer
while (!ready)   // plain / relaxed
   ;
use(data);
Global order in which writes actually become visible:
Press run to watch the two threads race.
Design a single-producer/single-consumer descriptor ring using C11 atomics. Which operations need relaxed, acquire, or release ordering, and why?staff

For SPSC, the data slots are plain memory; only the indices are atomic. The producer owns prod, the consumer owns cons; each side reads the other's index to detect full/empty.

Producer:

uint32_t c = atomic_load_explicit(&cons, memory_order_acquire);
if ((p - c) < CAP) {                 /* space available */
    ring[p & mask] = desc;           /* plain store */
    atomic_store_explicit(&prod, p + 1, memory_order_release);
}

Consumer:

uint32_t p = atomic_load_explicit(&prod, memory_order_acquire);
if (c != p) {                        /* not empty */
    struct desc d = ring[c & mask];  /* plain load */
    atomic_store_explicit(&cons, c + 1, memory_order_release);
}

The edges that matter: the producer's release store of prod publishes the descriptor write; the consumer's acquire load of prod guarantees it sees that descriptor *after* observing the advanced index. Symmetrically, the consumer's release store of cons publishes that it finished reading the slot, and the producer's acquire load of cons ensures it does not overwrite a slot the consumer is still reading. Each side keeps its *own* index in a private (relaxed or even non-atomic local) variable and only publishes with release.

seq_cst is unnecessary here: there is no multi-variable global-order requirement, and it would only add fence cost. The bug to avoid is using relaxed at the handoff - that makes the index update atomic but does not publish the slot contents, so the consumer can read a half-written descriptor.

What they're listening for: Strong answers identify the publication edge (descriptor before index) *and* the reuse edge (consumer release of `cons` so the producer does not clobber an in-flight slot). The nuance is not overusing `seq_cst` while never using `relaxed` at a synchronization point.
Follow-ups
  • What changes for MPSC or MPMC, and why does the simple index scheme break?
  • Where would you place cache-line padding between `prod` and `cons`?
  • How does this differ for MMIO doorbells versus shared memory?
Explain `happens-before` and `synchronizes-with` using a packet buffer pointer published between threads.senior

If thread A initializes a buffer then does a release store of its pointer, and thread B does an acquire load of that pointer which reads the value A stored, the release/acquire pair synchronizes-with. That establishes a happens-before edge from all of A's writes sequenced before the store to all of B's reads sequenced after the load.

Without that edge, B may observe the pointer but not the initialized contents in the language model, even where the bug rarely reproduces on hardware. memory_order_relaxed would make the pointer access atomic but publish nothing about the pointee.

pkt->len = len;
pkt->flags = flags;
atomic_store_explicit(&ready, pkt, memory_order_release);

struct pkt *p = atomic_load_explicit(&ready, memory_order_acquire);
if (p)
    consume(p->len, p->flags);   /* sees len/flags written before the release */

The acquire must read a value from the release store's *release sequence* (the store itself or a later RMW chain on it). An acquire load on an *unrelated* atomic does not magically make A's writes visible - synchronizes-with is per-object, between a specific release and the acquire that observes it.

What they're listening for: Checks whether the candidate can state the formal relationship and tie it to object publication. The trap is treating atomics as global cache flushes rather than per-object release/acquire pairs.
Follow-ups
  • What is a release sequence, and how does an intervening RMW extend it?
  • Does an acquire fence after a relaxed load create the same edge?
  • What if the pointer is published before the object is fully initialized?
When are fences the right tool instead of acquire/release operations on the atomic itself?senior

Prefer ordering on the atomic operation when you can, because it binds the ordering to the object carrying the synchronization, which is easier to audit. Fences earn their place when one barrier must order *several* relaxed operations, when the synchronization variable is only accessible via relaxed ops, or when matching a hardware/DMA protocol.

A release fence before relaxed stores publishes prior writes - but only if a matching acquire side establishes synchronization through an atomic that reads the stored value. A standalone fence with no participating atomic communicates with no one.

write_descs(batch);                    /* many plain stores */
atomic_thread_fence(memory_order_release);
atomic_store_explicit(&prod, new_prod, memory_order_relaxed);

The consumer pairs this with a relaxed load of prod followed by atomic_thread_fence(memory_order_acquire) before reading descriptors - or, more simply, a single acquire load of prod. Fence-fence and fence-atomic synchronization are subtler than operation-level acquire/release, so I reach for the fence form mainly to amortize one barrier over a batch.

In drivers, also separate C fences from device/DMA barriers: atomic_thread_fence constrains the language memory model and inter-CPU ordering, but is not the right primitive for ordering writes to a PCIe MMIO doorbell, where dma_wmb()/wmb() semantics are needed.

What they're listening for: Strong answers know fences are not global barriers and require a participating atomic to synchronize. Staff nuance: separating compiler/language fences, CPU cache-coherent fences, and device ordering.
Follow-ups
  • What is fence-to-atomic synchronization, precisely?
  • Why might Linux `dma_wmb()` be needed where a C `atomic_thread_fence` is insufficient?
  • How would you audit fence placement in a batching path?
A lock-free free-list uses CAS on the head pointer. Explain the ABA problem and compare tagged pointers, hazard pointers, and epoch/RCU-style reclamation.staff

ABA: a thread reads head A, stalls; another thread pops A (and maybe more), then pushes the same address A back. The stalled CAS sees head still equals A and succeeds, even though the logical state changed underneath it. Address equality masked a generation change - and if A->next was repurposed, the CAS installs a stale next and corrupts the list. This is distinct from an ordinary CAS *failure* (where the value simply changed and you retry).

Mitigations:

  • Tagged/generation pointers: pack a counter with the pointer so CAS compares (ptr, gen). Needs spare pointer bits or a double-width CAS (cmpxchg16b on x86-64, CASP/LL-SC pair on Arm). The counter can still wrap, so size it against the worst-case churn between a thread's read and its CAS.
  • Hazard pointers: each reader publishes the node addresses it may dereference; a reclaimer frees a node only when it appears in no hazard slot. Solves use-after-free and, by deferring reuse, also defeats ABA, at the cost of a per-access store plus a scan of all hazard slots before freeing.
  • Epoch/RCU-style: defer freeing until all readers present at retirement have left their read-side critical section (a grace period). Readers are nearly free, ideal for read-mostly structures, but a stalled reader can pin memory indefinitely (unbounded retention).

In a NIC datapath I would avoid a general lock-free free-list unless contention truly demands it; per-core caches with batched cross-core hand-off usually beat a clever global CAS structure on both latency and tail behavior.

CAS retries under contention until it wins.
CAS retries under contention until it wins.
head pointer
head = A
stack (top โ†’ bottom)
Aโ†’ B
Bโ†’ C
Cโ†’ โˆ…
free list
โˆ… none
โ—‹ T1 reads head โ†’ A, plans CAS(head, A, A.next = B). Then T1 STALLS.
โ—‹ T2 pops A (head โ† B), frees A.
โ—‹ T2 pops B (head โ† C), frees B.
โ—‹ T2 reuses freed A, pushes it back (head โ† A, A.next = C).
โ—‹ T1 wakes and runs CAS(head, A, B) โ€” the saved 'next'.
phase 0/5 ยท raw CAS
headโ†’X (node)โ†’tail
reader 1pre-existingholds โ†’ H
t = 0 ยท RCU ยท linked
What they're listening for: A strong answer separates ABA from ordinary CAS retry, notes the corrupted-`next` mechanism, and treats reclamation as a correctness problem with latency/retention tradeoffs - not cleanup.
Follow-ups
  • How many tag bits are realistically available on x86-64 canonical / Arm TBI pointers, and why is it ABI-dependent?
  • Can hazard pointers alone prevent *logical* ABA, or only memory reuse?
  • How would you test ABA deterministically (interleaving injection, delays)?
Write and critique a CAS loop for incrementing a shared counter. What ordering, weak-vs-strong, and progress issues matter?senior

Textbook loop:

uint64_t old = atomic_load_explicit(&ctr, memory_order_relaxed);
uint64_t next;
do {
    next = old + 1;
} while (!atomic_compare_exchange_weak_explicit(&ctr, &old, next,
             memory_order_relaxed, memory_order_relaxed));

For a pure statistic, relaxed suffices - there is no other data being published. compare_exchange_weak can fail *spuriously* (it may report mismatch even when equal) because on LL/SC machines - Arm, RISC-V, PowerPC - the store-conditional can fail on a cache event or context switch. That is fine *inside a loop*, where weak is preferred because it avoids the compiler emitting an inner retry loop; x86 has a single lock cmpxchg and does not fault spuriously, so the choice barely matters there. Use strong when there is no surrounding loop (e.g. a one-shot try). On failure, old is updated with the current value, so the loop re-reads for free.

For synchronization rather than a bare counter, the *success* order would be release, acquire, or acq_rel depending on what you publish/consume; the failure order may not be stronger than success and may not be release/acq_rel.

Progress: lock-free means *system-wide* progress, not that any given thread finishes promptly. A hot global counter serializes the cache line and burns interconnect bandwidth, hurting tail latency under incast. In a packet path, per-CPU counters aggregated periodically are usually the right answer. Also pick fetch_add over a CAS loop for plain increment - it is wait-free and lets the hardware do an atomic add directly.

What they're listening for: Separates atomicity from synchronization, explains weak's spurious failure via LL/SC, knows the failure order constraint, and recognizes a correct lock-free counter can still be a scalability bug.
Follow-ups
  • Why is `fetch_add` preferable to a CAS loop here, and when is it not possible?
  • What failure memory order is illegal with `compare_exchange`, and why?
  • How would you measure CAS retry cost under incast-like contention?
Define lock-free, wait-free, and obstruction-free. Why do these labels matter when designing a low-latency datapath?senior

Obstruction-free: a thread makes progress if it eventually runs in isolation (no contention). Lock-free: the system as a whole makes progress - *some* thread completes in a finite number of steps regardless of others' scheduling. Wait-free: *every* participating thread completes in a bounded number of its own steps.

The labels matter because tail latency is about the unlucky thread, not aggregate throughput. A lock-free MPMC queue can starve one producer indefinitely (it keeps losing CAS races) while still being lock-free - some thread always wins. For a control path that must return RX buffers to avoid starving the receive ring, that worst case can be unacceptable, and wait-free (e.g. fetch_add-based ticketing) may be required.

They also interact with OS scheduling. A theoretically lock-free design behaves badly if it spins on a CPU that gets preempted mid-operation, false-shares with another hot variable, or piles up CAS retries during a burst. Often a bounded SPSC queue per core plus explicit backpressure gives better, more predictable latency than a general lock-free structure.

What they're listening for: Good answers give precise progress definitions and connect them to tail latency and starvation. The trap is using 'lock-free' as a synonym for 'fast'.
Follow-ups
  • Can a lock-free algorithm block on memory allocation, and what does that do to its guarantee?
  • What progress guarantee do hazard pointers usually add or subtract?
  • How would you expose backpressure without locks?
Explain RCU read-side and update-side semantics. Where would RCU fit in a NIC driver or kernel-bypass stack, and where would it be a bad fit?staff

RCU lets readers dereference a shared pointer inside a read-side critical section while updaters publish a replacement and defer reclaiming the old version until a grace period proves all pre-existing readers have left. In common kernel RCU flavors readers are nearly free (no atomic RMW, often just disabling preemption); the cost is shifted to updaters and reclamation.

Good fits are read-mostly state: flow-steering/classification tables, device feature snapshots, routing policy, statistics-object replacement. The hot path reads without a lock while updates install a new version and free the old one after a grace period.

Bad fits: high update rate (grace periods and copies dominate), objects needing *immediate* destruction, or read sections that sleep/run unbounded without the correct RCU flavor (SRCU for sleepable readers). In user space you need a real quiescent-state mechanism (e.g. liburcu QSBR/memb); an atomic pointer plus 'free it later' is not RCU and will use-after-free.

Memory ordering is part of the API: rcu_assign_pointer() carries a release-style publish, and rcu_dereference() relies on the address dependency from the loaded pointer to its target to order the dependent reads (this is the dependency ordering C tried to standardize as memory_order_consume, which all compilers currently promote to acquire and which WG14/WG21 are retiring). Replacing these with plain loads/stores - or letting the compiler break the dependency via a value-speculating optimization - is a classic bug.

What they're listening for: A strong answer treats RCU as publication + deferred reclamation, names SRCU for sleepable readers, distinguishes real QSBR from 'free later', and ties `rcu_dereference` to dependency ordering / the `consume` story.
Follow-ups
  • What exactly is a grace period, and how does QSBR detect one cheaply?
  • Why does `rcu_dereference()` exist instead of a plain load, and how can the compiler break the dependency?
  • How would you implement a user-space quiescent-state scheme?
How does a seqlock work, and why is it risky for packet metadata or shared statistics?senior

A seqlock uses a monotonically increasing sequence counter. A writer increments it to odd (signaling write-in-progress), updates the data, then increments to even. A reader samples the counter, copies the data, then re-reads the counter; if it changed or was odd, the data may be torn and the reader retries.

It fits small read-mostly snapshots where readers tolerate retries and the copied data has no lifetime hazards. It is poor for protecting *pointers to reclaimable objects* unless combined with a separate lifetime scheme, because a reader can copy a pointer the instant before a writer replaces and frees the target.

Risks:

  • Readers can livelock if writers are too frequent.
  • A reader must perform no irreversible side effect on the data before the post-read validation passes.
  • Multiword data can be observed mid-update; only the counter re-check catches it, and only if the protocol is followed exactly.
  • On weakly-ordered machines the counter loads/stores need acquire/release (or kernel read_seqbegin/read_seqretry, which place the barriers for you); naive plain accesses let the data reads float outside the counter window.

For NIC statistics seqlocks are attractive for 64-bit counters on 32-bit systems or for grouped counters read consistently, but update frequency and retry cost must be measured against per-CPU alternatives.

What they're listening for: Probes optimistic consistency versus object lifetime, and the required barriers. The trap is thinking a seqlock protects pointers the way a mutex does.
Follow-ups
  • Why must the sequence counter become odd during a write?
  • Can seqlock readers sleep, and what happens if a writer is preempted mid-update?
  • How would you avoid reader livelock under a high packet rate?
Why is double-checked locking broken without atomics, and what is the minimal correct C11/C++11 publication pattern?senior

The broken version reads a plain global pointer, takes a lock if null, constructs the object, then plainly stores the pointer. A second thread can see the non-null pointer with no happens-before edge to the constructor's writes, so it may dereference a partially-constructed object. Separately, the unsynchronized plain load racing the store is a data race - undefined behavior in C/C++ on its own.

Correct pattern: acquire on the fast-path load, release on the publishing store, mutex guarding construction:

struct obj *p = atomic_load_explicit(&global, memory_order_acquire);
if (p == NULL) {
    lock(&m);
    p = atomic_load_explicit(&global, memory_order_relaxed);
    if (p == NULL) {
        p = construct_obj();
        atomic_store_explicit(&global, p, memory_order_release);
    }
    unlock(&m);
}
return p;

The inner load can be relaxed because the mutex already orders construction among lock holders (lock/unlock provide the necessary synchronization between writers). The fast-path load must be acquire so a reader that skips the lock still observes the fully-constructed object published by the release store. In C++, std::call_once / a function-local static (guaranteed thread-safe init since C++11) is usually preferable unless a measured custom fast path is justified.

What they're listening for: Strong answers name both failures - the data race and the missing publication edge - and explain precisely why the inner load may be relaxed (mutex orders the writers) while the outer must be acquire.
Follow-ups
  • Would `volatile` fix this? Why not?
  • Why exactly can the inner load be relaxed?
  • What does `pthread_once` / `std::call_once` / a C++ local static buy you?
What is load/store tearing, and how can it appear in low-level C even on hardware that supports naturally aligned atomic loads?staff

Tearing means another thread observes only part of a multi-byte update - e.g. the low 32 bits of one write and the high 32 bits of a different write. The C memory model only grants inter-thread atomicity to atomic objects. A plain shared uint64_t read racing a write is a data race and UB, whatever the hardware usually does.

Hardware nuance still matters. A naturally aligned word-size load may be atomic on a given ISA, but unaligned, wider-than-word, packed, bitfield, or compiler-split accesses can tear; an _Atomic object that is not lock-free may be implemented with a lock. 64-bit counters on 32-bit targets are the classic tear. Even with an aligned _Atomic, the compiler is free to *split* a non-atomic struct copy into multiple loads/stores, so field-by-field copying of a shared struct is not group-atomic.

DMA and MMIO add more: device updates live outside the C thread model entirely and need the platform's accessors and barriers, not _Atomic semantics.

Use _Atomic uint64_t (or kernel atomic64_t/seqcount helpers) for shared counters. For structs, do not assume a copy is atomic as a group - use a lock, a seqlock, a versioned snapshot, or publish immutable objects by pointer. The debugging symptom is impossible-looking metadata: a length from one packet paired with flags from another, or a counter that jumps backward.

What they're listening for: Tests separating hardware-atomicity folklore from language-defined data-race freedom. Good answers cover alignment, width, packing/bitfields, non-lock-free atomics, and the fact that a struct copy is not group-atomic.
Follow-ups
  • Can a `memcpy` of a shared struct tear, and why?
  • How would you correctly read a 64-bit statistic on a 32-bit target?
  • What does `atomic_is_lock_free` actually tell you, and what does it not?
Why is `seq_cst` sometimes required where acquire/release is not? Use the store-buffer / StoreLoad reordering case.staff

Acquire/release orders a release store with the acquire that *reads it* - a producer/consumer edge on one object. It does not prevent StoreLoad reordering: a store followed by a load of a *different* location can appear reordered, because the store sits in the core's store buffer while the later load is satisfied early. Even x86's strong TSO model permits exactly this one reordering.

The canonical failure is symmetric mutual exclusion / the Dekker pattern: two threads each set their own flag then read the other's.

/* thread 0 */                 /* thread 1 */
store(&x, 1, release);         store(&y, 1, release);
r0 = load(&y, acquire);        r1 = load(&x, acquire);

With release/acquire (or even on bare TSO), both loads may read the old 0, so both threads believe they have exclusivity - mutual exclusion is violated. Release/acquire cannot forbid this because there is no synchronizes-with edge: neither load reads the value the other store produced.

seq_cst fixes it by additionally enforcing a single total order over all seq_cst operations that every thread agrees on, which forbids the StoreLoad reorder. On x86 the compiler implements a seq_cst store as xchg or store + mfence to drain the store buffer; on Arm it uses stlr/ldar plus the needed fences. The cost is that drain, so I reserve seq_cst for exactly these multi-location 'each writes then reads the other' problems and use cheaper release/acquire for ordinary producer/consumer handoff.

What they're listening for: The differentiator is naming StoreLoad reordering and the store buffer as the precise gap acquire/release leaves open, and that even x86 TSO allows it. Bonus: the `xchg`/`mfence` lowering and Dekker as the motivating case.
Follow-ups
  • Which single reordering does x86 TSO allow, and how does `mfence`/`xchg` close it?
  • Could you fix the Dekker pattern with a single seq_cst fence instead of seq_cst on every op?
  • Why is `seq_cst` not needed for a normal SPSC ring handoff?
Implement a try-pop for a Treiber (lock-free) stack and walk through every memory-safety and ordering hazard.staff

A Treiber stack pushes/pops via CAS on the head. A naive try-pop:

struct node { _Atomic(struct node *) next; int val; };
_Atomic(struct node *) head;

bool try_pop(int *out) {
    struct node *h = atomic_load_explicit(&head, memory_order_acquire);
    while (h) {
        struct node *nx = atomic_load_explicit(&h->next, memory_order_relaxed);
        if (atomic_compare_exchange_weak_explicit(&head, &h, nx,
                memory_order_acq_rel, memory_order_acquire)) {
            *out = h->val;
            reclaim(h);          /* <-- the hard part */
            return true;
        }
    }
    return false;
}

Hazards:

  • Use-after-free / dangling read: between loading h and reading h->next, another thread can pop and free h. The h->next load then touches freed memory. This is *not* solved by ordering; it needs reclamation (hazard pointer protecting h, epoch/RCU, or a tagged scheme).
  • ABA: if h is freed and the same address is pushed back, the CAS succeeds with a stale nx and corrupts the stack. Tagged pointers or a reclamation scheme that prevents reuse-while-referenced fixes this.
  • Ordering: the CAS needs at least acquire on success so the popper sees the pusher's initialization of the node; acq_rel also gives release so a subsequent pusher reusing the slot is ordered. The failure order is acquire (re-read head meaningfully). The h->next load being relaxed is acceptable only because the successful CAS provides the acquire that orders the eventual use.
  • Progress: lock-free, not wait-free - a thread can lose the CAS race arbitrarily often.

The staff conclusion: the lock-free *algorithm* is the easy 80%; safe *reclamation* is the 20% that is genuinely hard, and on a NIC datapath a per-core stack with batched migration usually beats fighting this.

What they're listening for: The signal is recognizing that the dangling `h->next` read and ABA are reclamation problems independent of memory ordering, and choosing correct success/failure orders for the CAS. Weak answers 'fix' it with stronger fences, which does nothing for use-after-free.
Follow-ups
  • Exactly where does the use-after-free window open, and how does a hazard pointer close it?
  • Why is `acq_rel` on success the right choice over `acquire`?
  • How would epoch-based reclamation change this code's fast path?