Ring buffers in the NIC datapath
On this team the ring buffer isn't academic — it's how a NIC and the host hand packets to each other. If they hand you a whiteboard, this is the core model to explain cleanly.
A NIC RX/TX ring is a ring buffer
Every NIC moves packets through descriptor rings in host memory. The mental leap to make out loud in the interview: a descriptor ring is the same producer/consumer circular-buffer pattern as the ring buffer page— except the two parties aren't two threads, they're the host CPU and the NIC hardware, communicating through shared DMA memory plus an MMIO doorbell register.
// A NIC RX descriptor ring lives in host memory the NIC can DMA into.
// It IS a ring buffer — the two "threads" are the host CPU and the NIC.
typedef struct {
uint64_t addr; // device DMA address (often an IOVA, not a CPU pointer)
uint16_t len; // filled in by the NIC: bytes received
uint16_t flags; // status / ownership / end-of-packet bits
} rx_desc_t;
typedef struct {
rx_desc_t *ring; // 'size' descriptors, size is a power of two
uint32_t size; // e.g. EF_RXQ_SIZE on Solarflare
uint32_t mask; // size - 1
uint32_t fill; // SW producer: next descriptor to hand the NIC
uint32_t reap; // SW consumer: next completion to process
};
// 1. Driver posts empty buffers, then rings the doorbell (an MMIO write)
// telling the NIC how many descriptors are now available.
// 2. NIC DMAs an incoming packet into ring[hw_idx].addr, records
// length/status via descriptor state or a completion event, advances.
// 3. Driver reaps completed descriptors, hands the packet up the stack,
// refills the slot, and rings the doorbell again.
RX ring (receive)
The driver is the producer of empty buffers; the NIC is the consumer that fills them. The driver posts buffers and rings the doorbell; the NIC DMAs packets in, reports completion via descriptor status or events; the driver reaps and refills. If refills fall behind, the RX ring runs dry and packets get dropped (or flow control kicks in).
TX ring (transmit)
Reversed: the driver produces descriptors pointing at packets to send and rings the doorbell; the NIC consumes and transmits. With CTPIO, software pushes packet data via programmed I/O so the NIC begins cut-through transmission before the whole frame has crossed PCIe — pure latency shaving.
head, tail, and a publish step — in silicon, with a memory barrier in exactly the same place as the SPSC queue's release store.Two ownership schemes — know both names
Interviewers probe whether you know how the two sides agree on who owns a descriptor. There are two common designs:
- Head/tail registers (Intel E1000-style). The driver writes the tail (
TDT/RDT) to publish descriptors; the hardware advances the head (TDH/RDH) as it consumes them. On a TX ring,TDT == TDHmeans empty. Full is avoided by keeping a gap or trackingnext_to_usevsnext_to_cleanin software. - Descriptor ownership bit. Each descriptor carries an
OWN/ done / status bit. WhenOWNis set the DMA engine owns the descriptor; when cleared, the driver does. Many DPDK PMDs surface the same idea through RX descriptor status, often backed by a hardwareDD(descriptor-done) bit.
The Solarflare / ef_vi vocabulary
Drop the right nouns and you sound like you already work there. An ef_vi virtual interface bundles an event queue plus a TX descriptor ring and RX descriptor ring, each with its own doorbell. The adapter signals completions via the event queue, not by mutating the descriptor in place. The receive ring depth is EF_RXQ_SIZE (valid values 512 / 1024 / 2048 / 4096; EF_TXQ_SIZE for TX) — a bigger RX ring absorbs larger bursts without drops, at the cost of a larger working set and the risk of hiding backpressure (worse tail latency).
And the latency flourish: TX_PUSH. When the TX descriptor ring is empty, ef_vi can ring the doorbell andwrite the packet buffer address “in one shot.” It's a sibling latency shortcut to CTPIO but a distinct mechanism: TX_PUSH accelerates the descriptor notification, while CTPIO pushes the packet data.
The lock-free SPSC ring buffer
This is the exercise to have cold: “implement a lock-free queue between a producer and a consumer” (an ISR and the main loop, or two threads).
Derive it live — don't just recite the code
narrate this chain in the roomThe interviewer is grading the reasoning, not the final listing. Build the data structure up one forced decision at a time, and say each “problem → therefore” out loud. This is the whole answer — the code at the end just falls out of it:
- I need to hand items from one thread to another. Simplest start: an array and a count.A bounded FIFO. Good enough to reason from.
- But dequeuing from the front of an array is O(n) — everything shifts down.Don't move the data, move indices. Keep a write cursor (
head) and a read cursor (tail). Both ops become O(1). - Those cursors run off the end of the array.Wrap them:
i % capacity. And if I make capacity a power of two, that'si & (capacity-1)— one AND instead of a divide. - When head == tail, is the buffer empty or full? Same cursor position for both — ambiguous.Let the indices run free (never mod the counters themselves, only when I index storage). Then empty is
head == tailand full ishead - tail == capacity. Unsigned wraparound is well-defined, so no wasted slot, no separate count. - Now two threads touch these cursors. Do I need a lock?If only the producer writes head and only the consumer writes tail, there is no contended write — each side reads the other's cursor but never writes it. No mutex needed. That's the SPSC insight (Lamport, 1977).
- Is plain shared memory enough? On a weakly-ordered CPU the store that bumps head could become visible before the payload write — the consumer would read garbage.Publish head with a release store; observe it with an acquire load. That pair creates the happens-before edge guaranteeing the payload is visible once the new head is. Mirror it for tail. This is the entire correctness argument.
- It's correct now. Is it fast? head and tail sit next to each other, so they share a cache line.Two cores writing the same line ping-pong it (false sharing). Put head and tail on separate cache lines with
alignas(64).
Every line below is just the residue of that reasoning — point back at the step each piece came from as you write it:
#include <stdatomic.h>
#include <stdalign.h>
#include <stdint.h>
#include <stdbool.h>
#define CACHELINE 64
typedef struct {
// Producer-owned cache line: ONLY the producer writes 'head'.
alignas(CACHELINE) _Atomic uint32_t head;
unsigned char pad1[CACHELINE - sizeof(_Atomic uint32_t)];
// Consumer-owned cache line: ONLY the consumer writes 'tail'.
alignas(CACHELINE) _Atomic uint32_t tail;
unsigned char pad2[CACHELINE - sizeof(_Atomic uint32_t)];
// Read-only after init.
uint32_t mask; // capacity - 1, capacity is 2^n
uint8_t *buf; // 'capacity' bytes
} spsc_t;
// Heap-allocate with aligned_alloc(CACHELINE, sizeof(spsc_t)) — plain
// malloc need not honour a 64-byte over-alignment in portable C.
// ---- Producer thread (e.g. the RX path / an ISR) -------------------
bool spsc_put(spsc_t *q, uint8_t v) {
uint32_t h = atomic_load_explicit(&q->head, memory_order_relaxed); // we own head
uint32_t t = atomic_load_explicit(&q->tail, memory_order_acquire); // observe consumer
if (h - t > q->mask) return false; // full when used == mask + 1 (capacity)
q->buf[h & q->mask] = v; // (1) write the payload
atomic_store_explicit(&q->head, h + 1, memory_order_release); // (2) PUBLISH
return true;
}
// ---- Consumer thread (e.g. the application / main loop) ------------
bool spsc_get(spsc_t *q, uint8_t *out) {
uint32_t t = atomic_load_explicit(&q->tail, memory_order_relaxed); // we own tail
uint32_t h = atomic_load_explicit(&q->head, memory_order_acquire); // observe producer
if (h == t) return false; // empty
*out = q->buf[t & q->mask]; // (1) read payload (safe!)
atomic_store_explicit(&q->tail, t + 1, memory_order_release); // (2) free the slot
return true;
}
Say these four things and you've nailed it
- Power-of-two capacity → mask, not modulo.
i & (cap-1)replacesi % cap. One AND instead of a divide — and free-runninguint32_tindices wrap correctly because unsigned overflow is well-defined, as long as capacity stays far below half the counter range sohead - tailcan't become ambiguous. - Full vs empty is unambiguous with free-running indices. Empty when
head == tail; full whenhead - tail == capacity. No wasted slot, no count field. - release / acquire is the correctness crux. The producer's
releasestore onheadpublishes the payload write that came before it; the consumer'sacquireload ofheadguarantees it sees that payload. Mirror fortail. Get this wrong and you read torn/stale data on weakly-ordered CPUs (ARM, POWER, RISC-V). - Cache-line padding kills false sharing. Pad
headandtailonto separate cache lines so the producer and consumer cores aren't ping-ponging the same line. If the queue is heap-allocated, usealigned_alloc(64, …)— plainmallocisn't guaranteed to honour a 64-byte over-alignment in portable C.
volatile to make this thread-safe, that's a red flag. volatile prevents the compiler from caching/eliding an access, but it is not a memory barrier, gives no atomicity, and gives no cross-CPU ordering. Thread safety needs _Atomic + acquire/release (or a barrier). volatile is for MMIO registers and ISR-visible variables on a single core — a different job.Name-drop the real implementations
It signals you've seen production code, not just a textbook: DPDK's rte_ring (multi-producer variant uses a CAS on the head), Linux's kfifo (power-of-two, barrier-based), and folly's ProducerConsumerQueue (the canonical C++ SPSC with readIndex/writeIndex and acquire/release). Lamport's 1983 paper is the original proof that SPSC needs no locks.
Bonus: bulk copy across the wrap
A likely follow-up: “now make it copy a whole buffer at once.” The answer is two memcpys — one to the end of storage, one for the wrapped remainder — never a per-byte branch. This is how real stacks move a packet into the ring.
#include <string.h>
// Bulk copy across the wrap point — two memcpys, never a branch per byte.
// Assumes rb->cap is a power of two (wraps with & cap-1); else use % cap.
size_t rb_write(ringbuf_t *rb, const void *src, size_t n) {
size_t free = rb->cap - rb->count;
if (n > free) n = free; // clamp to available space
size_t first = rb->cap - rb->head; // bytes until the end of storage
if (first > n) first = n;
memcpy(rb->buf + rb->head, src, first); // head .. end
memcpy(rb->buf, (const char *)src + first, n - first); // wrapped remainder
rb->head = (rb->head + n) & (rb->cap - 1);
rb->count += n;
return n;
}
Primary sources worth reading
If you read three things before the call, read the first three. These are the authoritative references behind everything above.
- Linux kernel: Circular Buffers — the canonical writeup on power-of-two buffers + the exact memory barriers when producer and consumer share no lock.
- AMD UG1586: Onload User Guide — ef_vi — the layer-2 API under Onload, straight from AMD's docs.
- DPDK Ring Library — production lockless FIFO: two producer/consumer head-tail pairs, masked 32-bit counters, bulk/burst ops.
- folly ProducerConsumerQueue.h — small, readable SPSC with cache-line separation and acquire/release.
- MIT 6.1810 E1000 networking lab — write a real NIC driver; cements TDH/TDT and the “TX is empty in steady state, RX holds free buffers” inversion.
- Lamport, “Concurrent Reading and Writing” (1977) — the origin proof that single-reader/single-writer needs no mutex.