๐ŸงฉLive coding: the SPSC ring buffer

The exercise at the heart of the first interview, narrated end to end: clarify, design, the power-of-two mask, head/tail ownership, acquire/release (and what breaks on Arm), cache-line padding, batching, testing, and benchmarking.

Before you start coding an SPSC ring buffer in C, what do you ask me?

I would pause and turn the exercise into a contract before writing code.

The questions I would ask are:

  • What is stored: bytes, pointers, fixed-size events, or generic void * items?
  • Is it exactly single producer and single consumer, each on its own thread?
  • Should push/pop be non-blocking, or should they spin/block?
  • What should happen when the ring is full: fail, overwrite, or wait?
  • Is capacity fixed at init time, and can we require a power-of-two size?
  • Do we need full capacity, or is leaving one slot empty acceptable?
  • Are we targeting portable C11 atomics, Linux kernel style, or a specific platform?

For a live interview I would propose the simplest useful contract: fixed capacity, one producer thread, one consumer thread, non-blocking push and pop, return false on full/empty, power-of-two capacity, and leave one slot empty to disambiguate full from empty.

What they're listening for: This is where Mohamed should show discipline. Do not jump straight into code. Clarify ownership, blocking behavior, overflow policy, capacity, and atomics. It sounds like a NIC datapath engineer because the queue is a contract, not just an array.
Likely follow-ups
  • What if I say overwriting old data is acceptable?
  • What if the producer is an interrupt handler?
  • Why does fixed capacity matter in a datapath?
Define the data structure and API you would start with.

I would start with a narrow API so the synchronization rules are easy to reason about. For the live exercise I would store pointers, because packet/event rings often pass ownership of buffers rather than copying payloads.

#include <stdatomic.h>
#include <stdbool.h>
#include <stddef.h>

struct spsc_ring {
    size_t cap;
    size_t mask;
    _Atomic size_t head;   /* producer writes */
    _Atomic size_t tail;   /* consumer writes */
    void **slots;
};

bool ring_init(struct spsc_ring *r, void **storage, size_t cap);
bool ring_push(struct spsc_ring *r, void *item);
bool ring_pop(struct spsc_ring *r, void **item_out);

I would say out loud that head is the next slot the producer will write, and tail is the next slot the consumer will read. The array index is idx & mask, so cap must be a power of two. This version leaves one slot empty, so usable capacity is cap - 1.

If they prefer storing objects instead of pointers, I would make the element type explicit rather than using void *. For a first implementation, clear ownership beats a clever generic interface.

What they're listening for: Strong signal: state who writes each index. That sets up the lock-free reasoning later. Do not hide behind a generic macro library unless asked.
Likely follow-ups
  • Why store pointers rather than copying packet bytes?
  • Who owns `storage` and how long must it live?
  • What invariant relates `head`, `tail`, and occupancy?
Write the naive first version, before worrying about atomics or performance.

I would first write the single-threaded version to make the ring mechanics visible. Then I would upgrade the ordering.

static bool is_power_of_two(size_t x)
{
    return x != 0 && (x & (x - 1)) == 0;
}

bool ring_init(struct spsc_ring *r, void **storage, size_t cap)
{
    if (!r || !storage || !is_power_of_two(cap) || cap < 2)
        return false;

    r->cap = cap;
    r->mask = cap - 1;
    atomic_init(&r->head, 0);
    atomic_init(&r->tail, 0);
    r->slots = storage;
    return true;
}

bool ring_push_naive(struct spsc_ring *r, void *item)
{
    size_t head = atomic_load_explicit(&r->head, memory_order_relaxed);
    size_t tail = atomic_load_explicit(&r->tail, memory_order_relaxed);
    size_t next = (head + 1u) & r->mask;

    if (next == tail)
        return false;

    r->slots[head] = item;
    atomic_store_explicit(&r->head, next, memory_order_relaxed);
    return true;
}

bool ring_pop_naive(struct spsc_ring *r, void **item_out)
{
    size_t tail = atomic_load_explicit(&r->tail, memory_order_relaxed);
    size_t head = atomic_load_explicit(&r->head, memory_order_relaxed);

    if (tail == head)
        return false;

    *item_out = r->slots[tail];
    atomic_store_explicit(&r->tail, (tail + 1u) & r->mask, memory_order_relaxed);
    return true;
}

I would call this naive deliberately. It gets the circular buffer logic right, but the all-relaxed version is not yet portable for two threads because the index update can become visible before the slot contents are visible.

What they're listening for: This is a good live-coding move: get a correct mechanical skeleton first, then explain why it is insufficient. Mohamed should not pretend the first version is done.
Likely follow-ups
  • What test would catch the ring wraparound?
  • Why is `cap < 2` invalid with one empty slot?
  • What exactly is wrong with all-relaxed in two threads?
Why require capacity to be a power of two, and why use a bitmask instead of modulo?

If cap is a power of two, wraparound can use idx & (cap - 1). For example with capacity 1024, idx & 1023 maps a monotonically increasing index into the array.

The reasons are:

  • & mask is cheaper and predictable in a hot path.
  • It avoids a division-like % operation when the compiler cannot prove the divisor is constant.
  • It matches how many descriptor rings are sized in low-level datapaths.

I would not overstate it. If the ring is not hot, % cap is simpler and may be compiled well. But in a live SPSC exercise for a NIC-style role, I would choose power-of-two capacity because it is conventional and keeps the index math cheap.

One detail: I prefer monotonically increasing head and tail counters, then mask only when indexing. That makes occupancy head - tail natural. The earlier naive code wrapped the counters themselves, which is simpler to show but less flexible for full-capacity variants.

What they're listening for: They are listening for practical performance reasoning, not folklore. Say mask is useful in hot paths, but do not claim modulo is always terrible.
Likely follow-ups
  • What happens if `cap` is not a power of two and you still use `& mask`?
  • Why might monotonic counters be nicer than wrapped counters?
  • How large can the counters grow before wraparound matters?
Now write the SPSC push/pop with correct C11 acquire/release ordering.

I would switch to monotonic indices. Only the producer writes head; only the consumer writes tail. Each side may read the other side's index.

bool ring_push(struct spsc_ring *r, void *item)
{
    size_t head = atomic_load_explicit(&r->head, memory_order_relaxed);
    size_t tail = atomic_load_explicit(&r->tail, memory_order_acquire);

    if (head - tail == r->cap - 1)
        return false;

    r->slots[head & r->mask] = item;
    atomic_store_explicit(&r->head, head + 1u, memory_order_release);
    return true;
}

bool ring_pop(struct spsc_ring *r, void **item_out)
{
    size_t tail = atomic_load_explicit(&r->tail, memory_order_relaxed);
    size_t head = atomic_load_explicit(&r->head, memory_order_acquire);

    if (tail == head)
        return false;

    *item_out = r->slots[tail & r->mask];
    atomic_store_explicit(&r->tail, tail + 1u, memory_order_release);
    return true;
}

The producer writes the slot, then publishes the new head with release. The consumer reads head with acquire before reading the slot. That pair prevents the consumer from seeing the new index but stale slot contents.

The reverse direction is for space: the consumer finishes reading the slot, then publishes tail with release. The producer reads tail with acquire before deciding the slot is free to reuse.

What they're listening for: This is the core of the exercise. The key words are: own index relaxed load, other index acquire load, publish with release, slot write before head publish, slot read before tail publish.
Likely follow-ups
  • Why is the producer's load of `head` relaxed?
  • Why does the producer acquire-load `tail`?
  • Can the slot array itself be non-atomic?
Why does SPSC not need a lock, and what would make that statement false?

It does not need a lock because there is no contention on writes to the same index.

The ownership split is:

  • Producer is the only writer of head.
  • Consumer is the only writer of tail.
  • Producer writes a slot only before publishing head.
  • Consumer reads a slot only after observing published head.
  • Producer reuses a slot only after observing published tail.

So the atomics are not protecting a shared read-modify-write. They are providing visibility and ordering between two ownership domains.

The statement becomes false if there are two producers, two consumers, a shared count, a resize operation while the ring is active, or any side touching slots outside the ownership protocol. For MPSC or MPMC, two threads can try to claim the same slot or update the same index, so you need a stronger reservation mechanism such as compare-exchange, per-slot sequence numbers, or a lock.

What they're listening for: This should sound like Mohamed's MCU-DSP boundary story: lock-free because ownership is simple, not because atomics are magic.
Likely follow-ups
  • Why is a shared `count` unattractive here?
  • What if the producer wants to peek at the consumer's item?
  • How does this map to a NIC descriptor ring?
Be precise: what can reorder if we use only relaxed atomics, and what breaks on ARM or another weakly ordered CPU?

With only relaxed atomics, the atomic operations are indivisible, but there is no ordering relationship between the slot access and the index publish.

The producer code logically says:

r->slots[head & r->mask] = item;
atomic_store_explicit(&r->head, head + 1u, memory_order_relaxed);

But on a weak memory model, another core may observe the head store before the slot store is visible to it. Then the consumer sees head != tail, reads the slot, and gets an old pointer or uninitialized value.

The consumer side has the mirror issue. If it publishes tail before its slot read is effectively complete, the producer may reuse that slot while the consumer is still reading it.

On x86 TSO, normal stores are observed in store order by other cores, so the first bug is often hidden in practice. But I should not write code whose correctness depends on x86 behavior if the target could be ARM, or if I want portable C11 reasoning. Acquire/release states the actual contract: data first, then publish; observe publish, then consume data.

volatile would not fix this. It only constrains some compiler optimizations on the volatile access. It does not create the cross-core happens-before relationship that the ring needs.

What they're listening for: This is where many candidates hand-wave. Mohamed should name the exact bad outcome: consumer sees the new head before the slot contents. Then mention x86 TSO may mask it, ARM can expose it, and `volatile` is not synchronization.
Likely follow-ups
  • Why does x86 often hide the bug?
  • What is the acquire load preventing on the consumer side?
  • How is this similar to descriptor write before doorbell?
Can we make any of the operations relaxed without breaking correctness?

Yes, but only where the ordering is not carrying ownership between threads.

The load of my own index can be relaxed because only I write it:

  • Producer loads head relaxed.
  • Consumer loads tail relaxed.

The publish stores should be release:

  • Producer release-stores head after writing the slot.
  • Consumer release-stores tail after reading the slot.

The load of the other side's index is where I use acquire:

  • Consumer acquire-loads head before reading a produced slot.
  • Producer acquire-loads tail before reusing a consumed slot.

Some implementations weaken the producer's load of tail if slot lifetime is otherwise safe, but in an interview I would keep acquire there because it cleanly expresses that tail means the consumer is finished with the slot. I would rather be correct and explain the possible optimization than remove ordering prematurely.

What they're listening for: A good answer separates atomicity from ordering. It also shows judgement: optimize after proving the contract, not by weakening memory orders casually.
Likely follow-ups
  • Is `memory_order_seq_cst` wrong here?
  • What does acquire/release buy compared with a full fence?
  • Would the answer differ for trivially copyable integers instead of pointers?
What cache-line issue do you expect with `head` and `tail`, and how would you avoid it?

head and tail are written by different cores. If they sit on the same cache line, the cache line bounces between producer and consumer even though they are not logically sharing a writable variable. That is false sharing.

I would pad or align the hot fields so producer-owned and consumer-owned state are on separate cache lines.

#define CACHELINE 64

struct spsc_ring_padded {
    size_t cap;
    size_t mask;
    void **slots;

    _Alignas(CACHELINE) _Atomic size_t head;
    char pad1[CACHELINE - sizeof(_Atomic size_t)];

    _Alignas(CACHELINE) _Atomic size_t tail;
    char pad2[CACHELINE - sizeof(_Atomic size_t)];
};

In production I would be careful with portability around cache-line size and padding expressions, but the intent is simple: keep producer-written state and consumer-written state from invalidating each other's cache lines.

I would also avoid putting hot indices next to cold debug counters that another core updates. A small layout mistake can turn an otherwise good lock-free ring into a cache-coherency benchmark.

What they're listening for: They want more than algorithmic correctness. False sharing matters for a low-latency NIC team, and it connects directly to per-queue state layout.
Likely follow-ups
  • What tool or benchmark symptom would suggest false sharing?
  • Why is false sharing possible even without a data race?
  • Which fields belong together on the producer cache line?
How would you batch this ring to amortize overhead?

The single-item API is easy to reason about, but it pays atomic loads and stores per item. In a datapath, I would add batch operations so the producer reserves or checks a run of free slots, writes several items, then publishes head once. The consumer similarly reads head once, drains several items, then publishes tail once.

Conceptually:

size_t ring_push_many(struct spsc_ring *r, void **items, size_t n)
{
    size_t head = atomic_load_explicit(&r->head, memory_order_relaxed);
    size_t tail = atomic_load_explicit(&r->tail, memory_order_acquire);
    size_t used = head - tail;
    size_t free_slots = (r->cap - 1u) - used;
    size_t count = n < free_slots ? n : free_slots;

    for (size_t i = 0; i < count; i++)
        r->slots[(head + i) & r->mask] = items[i];

    atomic_store_explicit(&r->head, head + count, memory_order_release);
    return count;
}

The tradeoff is latency versus throughput. Batching reduces atomic/cache traffic and branch overhead, but if the producer waits too long to publish a batch, the first item in the batch waits longer. For a live answer I would say: expose both single-item and batch APIs, measure tail latency, and tune batch size to the workload.

What they're listening for: This is a NIC-datapath answer: batching is how you make line-rate software plausible, but it has a tail-latency cost.
Likely follow-ups
  • How does batching affect p99 latency?
  • What batch size would you start with?
  • How is this similar to NAPI polling or TX completion cleanup?
How would you test the ring before running any thread stress test?

I would start with deterministic single-thread tests because most ring bugs are plain arithmetic and boundary bugs before they are memory-ordering bugs.

The first tests:

  • Init rejects null pointers, non-power-of-two capacity, and capacity less than 2.
  • Empty ring: pop returns false.
  • Fill until usable capacity cap - 1; the next push returns false.
  • Pop all items and verify exact FIFO order.
  • Wraparound: push/pop repeatedly across several multiples of cap.
  • Alternating push/pop leaves the ring usable and ordered.
  • Invariant checks: occupancy is never greater than cap - 1; every pushed item is popped once.

I would also test small capacities, especially cap == 2, because leaving one slot empty means it can hold only one item. Tiny capacities expose off-by-one mistakes quickly.

Only after that passes would I move to two-thread stress, because a stress failure is much harder to diagnose if the basic wrap logic is already wrong.

What they're listening for: This answer should be boring and strong. Single-thread first is senior discipline: isolate deterministic correctness before blaming memory ordering.
Likely follow-ups
  • Why is capacity 2 an important edge case?
  • What assertion would you put inside `push` or `pop` in debug builds?
  • How would you test pointer ownership?
Now how would you test the concurrent version?

I would run a two-thread producer/consumer stress where the producer pushes a monotonically increasing sequence number and the consumer verifies exact order with no duplicates and no gaps.

The stress should vary:

  • Ring capacity: tiny, medium, and large.
  • Producer/consumer speed: add deliberate pauses or yields on either side.
  • Batch size if batching exists.
  • CPU affinity: same core, sibling cores, different cores if possible.
  • Runtime: short deterministic tests in CI and longer randomized stress locally.

I would compile and run with ThreadSanitizer when possible. Correct C11 atomics should avoid data races on head and tail; the slot accesses are non-atomic but should be ordered by the acquire/release protocol. If TSan reports races on slots, I would inspect whether the tool understands the synchronization; if not, I would still use the report as a prompt to verify the happens-before edges.

I would add counters for failed pushes and failed pops, but not treat them as correctness failures unless the test expects no backpressure. In a non-blocking ring, full and empty are normal states.

What they're listening for: The important invariant is sequence order. Mohamed should also show he knows stress is probabilistic and TSan is useful but not a substitute for reasoning.
Likely follow-ups
  • What does a duplicate sequence number imply?
  • What does a gap imply?
  • Why can a stress test pass on x86 and still be wrong on ARM?
How would you benchmark latency for this ring?

I would benchmark after correctness, and I would define the latency measurement carefully. For a ring, useful measurements are enqueue-to-dequeue latency and operation cost under different producer/consumer rates.

For a low-level benchmark I might use rdtsc or rdtscp on x86, but I would say the caveats out loud: pin threads, use invariant TSC, serialize enough for the measurement, avoid migrating between cores with unsynchronized counters, and subtract measurement overhead where possible.

I would include:

  • Warmup before recording, so cold caches and page faults do not dominate.
  • Preallocated memory; no allocation in the measured path.
  • CPU affinity and fixed frequency settings if the environment allows it.
  • Percentiles: p50, p90, p99, p999, not only mean.
  • Separate throughput and latency modes, because batching can improve one and hurt the other.

I would also use a histogram rather than just printing averages. For this team, tail latency is the interesting part: a mean can look fine while rare cache misses, scheduler events, or false sharing create spikes.

What they're listening for: They want measurement maturity. Mention warmup, pinning, percentiles, and `rdtsc` caveats. Do not claim one number means the design is fast.
Likely follow-ups
  • Why is the mean misleading for latency?
  • How can the benchmark itself perturb the result?
  • What would you compare against?
Curveball: how do you distinguish full from empty, and what if I demand all slots be usable?

The simple design leaves one slot empty:

  • Empty: head == tail.
  • Full: head - tail == cap - 1.
  • Usable capacity: cap - 1.

That is the version I would choose first because the invariants are clean.

If all slots must be usable, I have options:

  • Use monotonic counters and define full as head - tail == cap, empty as head == tail.
  • Use a separate count, but that is unattractive in SPSC because both sides may need to update it.
  • Use per-slot sequence numbers, which scales better to more complex queues but adds metadata and logic.

With monotonic counters, masking is only for indexing: slots[idx & mask]. The counters themselves keep enough history to tell full from empty. I would still be careful about unsigned wraparound and choose a wide type like size_t or uint64_t.

For the interview I would say: I started with one empty slot for clarity. If product requirements demand full capacity, monotonic counters are the next clean step.

What they're listening for: This curveball checks whether Mohamed knows the classic ambiguity and can offer variants without panicking.
Likely follow-ups
  • Why is a shared `count` worse for concurrency?
  • What counter width would you use?
  • How would per-slot sequence numbers help?
Curveball: can this ring be resized while producer and consumer are running?

Not safely in the simple design. The slot array, mask, and capacity are part of the synchronization contract. If one thread changes them while the other is indexing, the other thread can read from freed storage, use the wrong mask, or lose items.

The practical answers are:

  • Do not resize; choose capacity from the workload and expose full/drop counters.
  • Stop both producer and consumer, drain or quiesce the ring, allocate a new ring, then restart.
  • Add a much more complex indirection/versioning scheme, but that is no longer the simple SPSC exercise.

For a NIC-style datapath, I would usually prefer fixed-size rings. Dynamic resize in the hot path is a source of latency spikes and ownership bugs. If bursts overflow the ring, I would first ask whether the capacity, batching, backpressure, or consumer scheduling is wrong before adding live resize.

What they're listening for: This is a judgement test. The right answer is not an elaborate resize algorithm unless explicitly required. Fixed rings are normal in datapaths.
Likely follow-ups
  • How would you safely quiesce the ring?
  • What counters would tell you the ring is too small?
  • Why can a larger ring hurt latency?
Curveball: what changes for MPMC?

MPMC changes the problem fundamentally. In SPSC, each index has one writer. In MPMC, multiple producers may try to claim the same head, and multiple consumers may try to claim the same tail. A simple load/store index update is no longer enough.

I would need a reservation protocol, usually based on compare-exchange or fetch-add, and often per-slot sequence numbers so each slot carries its own state. The producer claims a sequence, waits until that slot is free, writes the item, then publishes that slot. The consumer claims a sequence, waits until that slot is ready, reads it, then marks it free for a future wrap.

The costs are higher:

  • More atomic read-modify-write operations.
  • More cache-line contention on shared indices.
  • More complicated full/empty and fairness behavior.
  • Harder testing and more subtle ABA/wraparound issues.

So I would not casually generalize the SPSC code. If the requirement is truly MPMC, I would either use a proven queue implementation or design it explicitly around per-slot sequence numbers. For the AMD exercise, the important point is knowing exactly which simplification SPSC bought us: single writer per index.

What they're listening for: This final curveball checks humility. SPSC code is not a small edit away from correct MPMC. Say that clearly.
Likely follow-ups
  • Why is compare-exchange needed?
  • What does a per-slot sequence number represent?
  • When would a lock be the pragmatic answer?