The technical coredescriptor ringslock-free SPSCmemory ordering

Ring buffers in the NIC datapath

On this team the ring buffer isn't academic — it's how a NIC and the host hand packets to each other. If they hand you a whiteboard, this is the core model to explain cleanly.

A NIC RX/TX ring is a ring buffer

Every NIC moves packets through descriptor rings in host memory. The mental leap to make out loud in the interview: a descriptor ring is the same producer/consumer circular-buffer pattern as the ring buffer page— except the two parties aren't two threads, they're the host CPU and the NIC hardware, communicating through shared DMA memory plus an MMIO doorbell register.

filerx_ring.c
// A NIC RX descriptor ring lives in host memory the NIC can DMA into.
// It IS a ring buffer — the two "threads" are the host CPU and the NIC.
typedef struct {
    uint64_t addr;     // device DMA address (often an IOVA, not a CPU pointer)
    uint16_t len;      // filled in by the NIC: bytes received
    uint16_t flags;    // status / ownership / end-of-packet bits
} rx_desc_t;

typedef struct {
    rx_desc_t *ring;   // 'size' descriptors, size is a power of two
    uint32_t   size;   // e.g. EF_RXQ_SIZE on Solarflare
    uint32_t   mask;   // size - 1
    uint32_t   fill;   // SW producer: next descriptor to hand the NIC
    uint32_t   reap;   // SW consumer: next completion to process
};

// 1. Driver posts empty buffers, then rings the doorbell (an MMIO write)
//    telling the NIC how many descriptors are now available.
// 2. NIC DMAs an incoming packet into ring[hw_idx].addr, records
//    length/status via descriptor state or a completion event, advances.
// 3. Driver reaps completed descriptors, hands the packet up the stack,
//    refills the slot, and rings the doorbell again.
Animated NIC RX descriptor ring: DMA fills a slot, the tail advances, a doorbell pulses
The RX descriptor ring (animated): the driver owns the tail (posts buffers, rings the doorbell), the hardware owns the head (DMAs packets in, sets the OWN/DD bit).
Host memory, PCIe, and NIC with DMA and MMIO doorbell
The whole datapath: descriptors and packet buffers live in host memory; the NIC reaches them by DMA across PCIe; the CPU pokes the NIC with an MMIO doorbell.
Animated publish sequence: write descriptor, dma_wmb barrier, ring doorbell, NIC DMAs and completes
The publish sequence (animated): write the descriptor → dma_wmb() barrier → ring the doorbell → the NIC DMAs the buffer and writes a completion. The barrier must come before the doorbell.

RX ring (receive)

The driver is the producer of empty buffers; the NIC is the consumer that fills them. The driver posts buffers and rings the doorbell; the NIC DMAs packets in, reports completion via descriptor status or events; the driver reaps and refills. If refills fall behind, the RX ring runs dry and packets get dropped (or flow control kicks in).

TX ring (transmit)

Reversed: the driver produces descriptors pointing at packets to send and rings the doorbell; the NIC consumes and transmits. With CTPIO, software pushes packet data via programmed I/O so the NIC begins cut-through transmission before the whole frame has crossed PCIe — pure latency shaving.

The producer/consumer index split is the whole trick. One side advances a producer index, the other a consumer index. The doorbell(an MMIO register write) is how software tells the NIC “I've produced up to here” — but only after a barrierthat makes the descriptor stores visible to the device first (the doorbell must not overtake the data). The NIC signals back through completion events or an ownership bit in the descriptor. Sound familiar? It's head, tail, and a publish step — in silicon, with a memory barrier in exactly the same place as the SPSC queue's release store.

Two ownership schemes — know both names

Interviewers probe whether you know how the two sides agree on who owns a descriptor. There are two common designs:

  • Head/tail registers (Intel E1000-style). The driver writes the tail (TDT/RDT) to publish descriptors; the hardware advances the head (TDH/RDH) as it consumes them. On a TX ring, TDT == TDH means empty. Full is avoided by keeping a gap or tracking next_to_use vs next_to_clean in software.
  • Descriptor ownership bit. Each descriptor carries an OWN / done / status bit. When OWN is set the DMA engine owns the descriptor; when cleared, the driver does. Many DPDK PMDs surface the same idea through RX descriptor status, often backed by a hardware DD (descriptor-done) bit.

The Solarflare / ef_vi vocabulary

Drop the right nouns and you sound like you already work there. An ef_vi virtual interface bundles an event queue plus a TX descriptor ring and RX descriptor ring, each with its own doorbell. The adapter signals completions via the event queue, not by mutating the descriptor in place. The receive ring depth is EF_RXQ_SIZE (valid values 512 / 1024 / 2048 / 4096; EF_TXQ_SIZE for TX) — a bigger RX ring absorbs larger bursts without drops, at the cost of a larger working set and the risk of hiding backpressure (worse tail latency).

And the latency flourish: TX_PUSH. When the TX descriptor ring is empty, ef_vi can ring the doorbell andwrite the packet buffer address “in one shot.” It's a sibling latency shortcut to CTPIO but a distinct mechanism: TX_PUSH accelerates the descriptor notification, while CTPIO pushes the packet data.

The walk-through they love. Be able to narrate a packet end to end: wire → NIC RX DMA → descriptor + event → NAPI poll (or ef_vi poll) → userspace buffer → TCP/IP or Onload stack. And the send: app buffer → descriptor → doorbell → DMA fetch / CTPIO / TX_PUSH → completion event.If you can talk through both directions and point at where the ring buffer, the doorbell, and the memory barriers sit, you've demonstrated exactly the mental model this team hires for.

The lock-free SPSC ring buffer

This is the exercise to have cold: “implement a lock-free queue between a producer and a consumer” (an ISR and the main loop, or two threads).

Derive it live — don't just recite the code

narrate this chain in the room

The interviewer is grading the reasoning, not the final listing. Build the data structure up one forced decision at a time, and say each “problem → therefore” out loud. This is the whole answer — the code at the end just falls out of it:

  1. I need to hand items from one thread to another. Simplest start: an array and a count.A bounded FIFO. Good enough to reason from.
  2. But dequeuing from the front of an array is O(n) — everything shifts down.Don't move the data, move indices. Keep a write cursor (head) and a read cursor (tail). Both ops become O(1).
  3. Those cursors run off the end of the array.Wrap them: i % capacity. And if I make capacity a power of two, that's i & (capacity-1) — one AND instead of a divide.
  4. When head == tail, is the buffer empty or full? Same cursor position for both — ambiguous.Let the indices run free (never mod the counters themselves, only when I index storage). Then empty is head == tail and full is head - tail == capacity. Unsigned wraparound is well-defined, so no wasted slot, no separate count.
  5. Now two threads touch these cursors. Do I need a lock?If only the producer writes head and only the consumer writes tail, there is no contended write — each side reads the other's cursor but never writes it. No mutex needed. That's the SPSC insight (Lamport, 1977).
  6. Is plain shared memory enough? On a weakly-ordered CPU the store that bumps head could become visible before the payload write — the consumer would read garbage.Publish head with a release store; observe it with an acquire load. That pair creates the happens-before edge guaranteeing the payload is visible once the new head is. Mirror it for tail. This is the entire correctness argument.
  7. It's correct now. Is it fast? head and tail sit next to each other, so they share a cache line.Two cores writing the same line ping-pong it (false sharing). Put head and tail on separate cache lines with alignas(64).

Every line below is just the residue of that reasoning — point back at the step each piece came from as you write it:

filespsc.c
#include <stdatomic.h>
#include <stdalign.h>
#include <stdint.h>
#include <stdbool.h>

#define CACHELINE 64

typedef struct {
    // Producer-owned cache line: ONLY the producer writes 'head'.
    alignas(CACHELINE) _Atomic uint32_t head;
    unsigned char pad1[CACHELINE - sizeof(_Atomic uint32_t)];
    // Consumer-owned cache line: ONLY the consumer writes 'tail'.
    alignas(CACHELINE) _Atomic uint32_t tail;
    unsigned char pad2[CACHELINE - sizeof(_Atomic uint32_t)];
    // Read-only after init.
    uint32_t mask;                       // capacity - 1, capacity is 2^n
    uint8_t *buf;                        // 'capacity' bytes
} spsc_t;
// Heap-allocate with aligned_alloc(CACHELINE, sizeof(spsc_t)) — plain
// malloc need not honour a 64-byte over-alignment in portable C.

// ---- Producer thread (e.g. the RX path / an ISR) -------------------
bool spsc_put(spsc_t *q, uint8_t v) {
    uint32_t h = atomic_load_explicit(&q->head, memory_order_relaxed); // we own head
    uint32_t t = atomic_load_explicit(&q->tail, memory_order_acquire); // observe consumer
    if (h - t > q->mask) return false;            // full when used == mask + 1 (capacity)
    q->buf[h & q->mask] = v;                      // (1) write the payload
    atomic_store_explicit(&q->head, h + 1, memory_order_release); // (2) PUBLISH
    return true;
}

// ---- Consumer thread (e.g. the application / main loop) ------------
bool spsc_get(spsc_t *q, uint8_t *out) {
    uint32_t t = atomic_load_explicit(&q->tail, memory_order_relaxed); // we own tail
    uint32_t h = atomic_load_explicit(&q->head, memory_order_acquire); // observe producer
    if (h == t) return false;                     // empty
    *out = q->buf[t & q->mask];                   // (1) read payload (safe!)
    atomic_store_explicit(&q->tail, t + 1, memory_order_release); // (2) free the slot
    return true;
}

Say these four things and you've nailed it

  • Power-of-two capacity → mask, not modulo. i & (cap-1) replaces i % cap. One AND instead of a divide — and free-running uint32_t indices wrap correctly because unsigned overflow is well-defined, as long as capacity stays far below half the counter range so head - tailcan't become ambiguous.
  • Full vs empty is unambiguous with free-running indices. Empty when head == tail; full when head - tail == capacity. No wasted slot, no count field.
  • release / acquire is the correctness crux. The producer's release store on headpublishes the payload write that came before it; the consumer's acquire load of head guarantees it sees that payload. Mirror for tail. Get this wrong and you read torn/stale data on weakly-ordered CPUs (ARM, POWER, RISC-V).
  • Cache-line padding kills false sharing. Pad head and tailonto separate cache lines so the producer and consumer cores aren't ping-ponging the same line. If the queue is heap-allocated, use aligned_alloc(64, …) — plain malloc isn't guaranteed to honour a 64-byte over-alignment in portable C.
The trap they're listening for: if you reach for volatile to make this thread-safe, that's a red flag. volatile prevents the compiler from caching/eliding an access, but it is not a memory barrier, gives no atomicity, and gives no cross-CPU ordering. Thread safety needs _Atomic + acquire/release (or a barrier). volatile is for MMIO registers and ISR-visible variables on a single core — a different job.

Name-drop the real implementations

It signals you've seen production code, not just a textbook: DPDK's rte_ring (multi-producer variant uses a CAS on the head), Linux's kfifo (power-of-two, barrier-based), and folly's ProducerConsumerQueue (the canonical C++ SPSC with readIndex/writeIndex and acquire/release). Lamport's 1983 paper is the original proof that SPSC needs no locks.

Bonus: bulk copy across the wrap

A likely follow-up: “now make it copy a whole buffer at once.” The answer is two memcpys — one to the end of storage, one for the wrapped remainder — never a per-byte branch. This is how real stacks move a packet into the ring.

filerb_write.c
#include <string.h>

// Bulk copy across the wrap point — two memcpys, never a branch per byte.
// Assumes rb->cap is a power of two (wraps with & cap-1); else use % cap.
size_t rb_write(ringbuf_t *rb, const void *src, size_t n) {
    size_t free = rb->cap - rb->count;
    if (n > free) n = free;                 // clamp to available space
    size_t first = rb->cap - rb->head;      // bytes until the end of storage
    if (first > n) first = n;
    memcpy(rb->buf + rb->head, src, first);             // head .. end
    memcpy(rb->buf, (const char *)src + first, n - first); // wrapped remainder
    rb->head = (rb->head + n) & (rb->cap - 1);
    rb->count += n;
    return n;
}

Primary sources worth reading

If you read three things before the call, read the first three. These are the authoritative references behind everything above.