๐ŸงฑDatapath Data Structures & Allocators

Probes whether a candidate can design hot-path data structures that respect cache, memory ordering, concurrency, and packet-rate failure modes.

Enqueue and dequeue a ring buffer; head and tail chase each other around.

012345670 / 8EMPTY
head (write โ†’ 0) tail (read โ†’ 0) occupied
Buffer initialized โ€” capacity 8, empty.
Compare an SPSC ring with an MPMC ring in a packet datapath. What changes in the algorithm and where do the hidden costs appear?senior

An SPSC ring can be extremely cheap because only one producer writes the producer index and only one consumer writes the consumer index. Each side can cache the other side's index and use release/acquire ordering around the descriptor store/load. There is no need for CAS on the hot index.

An MPMC ring needs coordination among producers and consumers. A common design reserves slots by atomically advancing a shared head, writes descriptors into the reserved range, then commits by publishing a tail in order. That reserve/commit split prevents consumers from seeing uninitialized slots, but it introduces CAS retries, cache-line bouncing on shared heads/tails, and a sharp failure mode: a producer that reserves slots and is then preempted blocks every later producer from advancing the visible tail, because the tail must be published in reservation order. This is the Lock-Waiter-Preemption (LWP) problem.

DPDK rte_ring exposes this distinction through single/multi producer and consumer modes; its default MP/MC mode updates head and tail as one 64-bit CAS, and it added HTS (head/tail sync) and RTS (relaxed tail sync) modes specifically to bound the damage a stalled producer does under overcommit. The single-producer path is not just a minor optimization; it removes a contended atomic protocol. In a low-latency datapath, choosing SPSC per queue or per core is often better than building a heroic shared MPMC queue.

MPMC: producers CAS a shared head; losers retry.
MPMC: producers CAS a shared head; losers retry.
What they're listening for: The key is the two-phase reserve/commit nuance and the Lock-Waiter-Preemption failure mode (a stalled producer blocks tail publication), not just knowing that MPMC uses atomics.
Follow-ups
  • Why can an MPMC ring be lock-free but still have terrible tail latency?
  • What cache lines would you pad?
  • When is a per-core SPSC mesh better than one shared MPMC queue?
Show a minimal SPSC ring enqueue/dequeue pair and explain the required memory ordering on a weakly ordered CPU.senior

The producer must publish the element before it publishes the new head. The consumer must observe the head before it consumes the element.

#define N 1024u
struct ring {
  void *slot[N];
  _Atomic unsigned head;
  _Atomic unsigned tail;
};

int push(struct ring *r, void *p) {
  unsigned h = atomic_load_explicit(&r->head, memory_order_relaxed);
  unsigned t = atomic_load_explicit(&r->tail, memory_order_acquire);
  if (((h + 1) & (N - 1)) == t) return -1;
  r->slot[h] = p;
  atomic_store_explicit(&r->head, (h + 1) & (N - 1), memory_order_release);
  return 0;
}

void *pop(struct ring *r) {
  unsigned t = atomic_load_explicit(&r->tail, memory_order_relaxed);
  unsigned h = atomic_load_explicit(&r->head, memory_order_acquire);
  if (t == h) return 0;
  void *p = r->slot[t];
  atomic_store_explicit(&r->tail, (t + 1) & (N - 1), memory_order_release);
  return p;
}

The release store of head and the acquire load of head form a release/acquire pair: it guarantees the consumer that observes the new head also observes the prior r->slot[h] write. The producer reads its own head relaxed (it is the only writer) but reads tail acquire to see freed slots. On x86 this may appear to work with plain loads/stores because the hardware is TSO and only reorders store-then-load, but portable C and weakly ordered CPUs (Arm/Power) genuinely need the acquire/release or the slot write can be observed after the index. The two indices should sit on separate cache lines to avoid producer/consumer false sharing; N being a power of two lets the wrap be a mask; and the one-empty-slot convention distinguishes full from empty without a separate count.

What they're listening for: This exposes whether the candidate can name the release/acquire pairing that ties the slot write to the index publication, not just cargo-cult volatile, and knows why x86 hides the bug.
Follow-ups
  • Why is `volatile` not a synchronization primitive?
  • Why exactly does x86 TSO hide a missing barrier here?
  • How would you batch index publication to cut cache traffic?
A junior engineer writes a lock-free MPSC stack with CAS on the head pointer and it corrupts under load. Walk through the ABA problem and how you would actually fix it.staff

The bug is almost always ABA. A popping thread reads head = A and computes A->next = B, intending CAS(&head, A, B). It is preempted. Meanwhile other threads pop A, pop B, push A back (often after the node was freed and reused). Now head == A again, so the stalled CAS(&head, A, B) succeeds even though B may be freed or no longer the correct successor. The pointer compared equal, but the world changed and changed back. The result is a freed node back on the stack, a lost node, or a cycle.

The failure is not the CAS instruction; it is that a raw pointer carries no notion of "has this been recycled". Fixes fall into two families.

Tag the pointer. Pack a monotonically incremented version counter alongside the pointer and CAS the whole thing, so a recycled pointer fails the compare because the tag advanced. With a 64-bit pointer you need a double-width CAS (cmpxchg16b / Arm CASP) or pointer-packing tricks in unused high bits:

typedef struct { void *ptr; uintptr_t tag; } tagged;   /* 16 bytes */
_Atomic tagged head;

void push(_Atomic tagged *h, struct node *n) {
  tagged old = atomic_load(h), neu;
  do {
    n->next = old.ptr;
    neu.ptr = n;
    neu.tag = old.tag + 1;          /* version defeats ABA */
  } while (!atomic_compare_exchange_weak(h, &old, neu));
}

Fix reclamation instead. ABA is really a use-after-free of recycled memory, so defer reuse until no thread can hold the old pointer: hazard pointers (each reader publishes the pointer it is about to dereference; a node is freed only when no hazard pointer references it), epoch/QSBR reclamation, or RCU. These also solve the related problem of a popper dereferencing A->next after A was freed, which a tag alone does not.

In a datapath I usually prefer to design the ABA away: per-core SPSC queues with no shared head, or index-based ring slots out of a fixed array (indices in a bounded ring don't suffer ABA the same way because slots aren't malloc'd/freed underneath you).

What they're listening for: A staff answer explains ABA as recycled-pointer aliasing, distinguishes tagged-pointer (needs double-width CAS) from reclamation schemes (hazard pointers/RCU also fix the dereference-after-free), and notes that designing with per-core/index rings sidesteps it.
Follow-ups
  • Why does a version tag need a double-width CAS on 64-bit?
  • What does a tag NOT protect against that hazard pointers do?
  • Why are bounded index rings less exposed to ABA?
You need a read-mostly flow table updated by a control plane while datapath cores do lockless lookups. How would you use RCU, and what are the common correctness bugs?senior

Use RCU when readers must run without taking a contended lock and updates are less frequent. In the kernel, readers enter an RCU read-side critical section, fetch protected pointers with rcu_dereference(), and writers publish replacement pointers with rcu_assign_pointer() or list helpers such as hlist_add_head_rcu(). The publish primitive carries a release barrier and the dereference a dependent-load (consume) barrier, so a reader that sees the new pointer also sees the fully initialized object. Deletion unlinks the object first, then frees it after a grace period with call_rcu() or synchronize_rcu(), where a grace period is the point at which every CPU has passed through a quiescent state and therefore no pre-existing reader can still hold the old pointer.

The common bugs are freeing immediately after unlink (skipping the grace period), modifying fields in place that readers assume are immutable, forgetting that readers can keep using the old object for a while, and publishing a pointer before the object is fully initialized (defeats the publish/subscribe barrier). Another bug is using RCU to protect lifetime but not multi-field consistency; if readers need a coherent snapshot, publish a new object (or a versioned struct) rather than updating fields one at a time.

For user-space datapaths the same pattern applies with QSBR or epoch reclamation: readers announce quiescent states, writers defer reclamation. The operational risk is a stuck reader: one core that never reports quiescence (spinning in a long loop, or a thread that forgot to mark quiescent state) stalls every grace period, so call_rcu callbacks pile up and the deferred-free backlog grows until memory pressure becomes the outage.

What they're listening for: A strong answer distinguishes pointer lifetime from object consistency, explains grace period as 'all CPUs passed a quiescent state', and calls out the stuck-reader grace-period backlog as a real production failure.
Follow-ups
  • What fields must be immutable under RCU?
  • How do you detect a stuck reader?
  • When is a seqlock better than RCU?
Cuckoo hashing is attractive for flow tables. Explain the lookup path, insertion failure modes, and concurrency implications.senior

Cuckoo hashing gives each key a small number of candidate buckets, typically two hash functions, and in practical designs several slots per bucket. Lookup checks only those candidate buckets, which is cache-friendly and worst-case bounded (a constant number of cache lines) versus chasing an open chain. DPDK's hash library uses exactly this bucketized cuckoo design for fixed-size keys.

Insertion is the hard part. If all candidate slots are full, the table evicts an existing key and re-homes it to its alternate bucket, which may cascade along an eviction path (chosen by random walk or BFS). Two numbers matter: basic cuckoo with two single-slot tables wedges around a 50% load factor, but bucketized cuckoo with ~4-8 slots per bucket pushes a stable load factor up toward ~95%, which is why real implementations bucketize. Near that ceiling, insertion latency becomes highly variable, the eviction walk can hit a bounded maximum displacement and fail, and you fall back to a small stash or trigger a resize/rehash. That is a serious datapath property: lookups stay O(1) and predictable while inserts develop a long, sometimes unbounded tail.

Concurrency depends on the implementation. Lockless readers are easy if slots are published atomically, but a writer that moves a key creates a window where a concurrent reader could miss it (see it in neither the source nor the destination bucket) unless you order the moves carefully and use versioning. The known-good trick (from MemC3 / libcuckoo) is per-bucket version counters: readers retry if a bucket's version changed during the read, and the displacement is done so the key is always findable. A candidate who says "cuckoo is O(1)" and stops is missing the insertion tail and the move-vs-read race.

What they're listening for: The nuance is the 50% (basic) vs ~95% (bucketized) load-factor ceiling, the bounded-displacement insertion failure with stash/resize, and the move-during-read race fixed by per-bucket version counters.
Follow-ups
  • Why does bucketizing raise the load-factor ceiling so much?
  • How do per-bucket version counters make reads safe during a move?
  • How do stored signatures reduce full key comparisons?
Why is `malloc()` on the packet hot path usually unacceptable, and how would you design a mempool for RX/TX buffers?senior

malloc() brings unpredictable latency, locks or shared metadata contention, cache pollution, TLB churn, fragmentation, and failure behavior that is hard to reason about at line rate. A packet path needs bounded allocation and free, preferably with the object already DMA-safe and cache-aligned.

A typical design has a fixed-size pool of packet buffers backed by a ring, with per-core caches (magazines) that satisfy most alloc/free without touching the shared ring, refilling/draining in bulk only when the local cache underruns or overflows. Bulk amortizes the one unavoidable atomic over many objects. The buffer layout reserves headroom for L2/L3 header push/pop, aligns frequently touched metadata to cache lines, and keeps DMA data and CPU metadata from false-sharing. For NIC RX, descriptors point at buffers recycled after the packet is consumed; for TX, a buffer cannot return to the pool until the completion path proves the NIC finished DMA-reading it, or you corrupt in-flight packets.

Two more decisions a senior engineer names. Huge pages: backing the pool with 2 MB/1 GB pages keeps each buffer's DMA mapping inside one IOMMU/TLB entry, which matters because at 148 Mpps a TLB miss per packet is fatal; it also keeps the pool physically contiguous. Overload policy: under pressure, do you drop early, backpressure the producer, steal from another core's cache, or pull from an emergency reserve? The allocator's overload behavior *is* part of the packet-loss behavior, so it should be a deliberate policy, not whatever malloc happens to do.

What they're listening for: This probes whether the candidate sees allocation as latency + DMA lifetime + cache layout + per-core magazines + huge pages + an explicit overload/drop policy, not merely a faster malloc.
Follow-ups
  • How large should per-core caches be, and what causes cache imbalance?
  • Why do huge pages matter for the DMA mapping specifically?
  • Why can a TX buffer not be freed at post time?
Intrusive linked lists are common in kernels and drivers. Why use them, and what are the hazards in a high-performance datapath?senior

An intrusive list stores the linkage inside the object, such as Linux struct list_head or struct hlist_node. That avoids separate allocation for list nodes, improves locality, and lets one object participate directly in queues such as free lists, timers, or hash chains. hlist uses a single-pointer head (instead of a full list_head) specifically so a large hash-bucket array is half the size, which matters when you have millions of buckets.

The hazards are ownership and lifetime. Removing an object from one list while another subsystem still expects it linked causes corruption. Reusing the same list_head node for two lists simultaneously is a classic bug (it can only be on one at a time). In concurrent code, readers can follow stale next pointers unless protected by a lock, RCU (list_*_rcu variants), reference count, or a strict single-owner rule. Poisoning pointers in debug builds (LIST_POISON) catches some use-after-remove bugs but is not a synchronization scheme.

In the datapath, a linked list is also a performance smell. Pointer chasing serializes dependent loads and defeats prefetch, so a linear scan wastes cache and memory bandwidth. Intrusive lists are good for cold control paths, free lists, timer buckets, and collision chains with bounded length; they are rarely ideal for per-packet linear scans, where an array or a hash with bucketed slots prefetches far better.

What they're listening for: The answer should balance why intrusive lists are practical in C (no node alloc, hlist halves bucket arrays) with the concurrency (RCU/single-owner) and pointer-chasing cache costs.
Follow-ups
  • How does `hlist` help large hash tables?
  • What invariant would you assert in debug builds?
  • When would you replace a list with an array?
Explain DIR-24-8 style IPv4 longest-prefix-match and why it is attractive for software forwarding. What are the memory and update tradeoffs?senior

DIR-24-8 splits IPv4 lookup into a direct table tbl24 indexed by the top 24 bits and second-level tbl8 tables for longer prefixes. Prefixes of length 24 or less are resolved with one array access. For routes longer than /24, the matching tbl24 entry is flagged to point at a 256-entry tbl8 indexed by the low 8 bits, costing a second access. DPDK's LPM library implements exactly this for 32-bit keys and exposes a bulk lookup API.

The attraction is predictable lookup: one or at most two cacheable memory references, no tree walk, and no key-length-dependent loop. That is ideal at packet rate, especially with batching and software prefetch over several destination addresses at once.

The cost is memory and update amplification. tbl24 alone is 2^24 entries (a 16M-entry array), and because a prefix shorter than /24 must be *expanded* across every tbl24 slot it covers, inserting or deleting a single short prefix can touch a large contiguous range of entries; a /8, for instance, rewrites 65536 tbl24 entries. Deleting a /25 means restoring the covered range to whatever the next-longest matching prefix was, which requires knowing that prefix, so production stacks keep an auxiliary structure (the control-plane RIB) to recompute it. IPv6's 128-bit space cannot use a 24-bit direct index at useful depth, so it needs a multi-level / tree or different scheme. For dynamic routing you separate the control-plane RIB from the dataplane FIB and publish updates with RCU-like lifetime so lookups never see a half-written table.

What they're listening for: This tests LPM as a cache/memory tradeoff: one or two accesses via tbl24/tbl8, the 16M-entry table, and prefix-expansion update cost (a /8 rewrites 64K entries; a /25 delete must restore the covering prefix).
Follow-ups
  • Why are bulk lookup APIs and prefetch useful here?
  • How would IPv6 change the design?
  • Why does deleting a /25 require knowing the covering prefix?
Where would you use a Bloom filter in a NIC or software datapath, and what are the failure modes?senior

A Bloom filter is a fast negative cache: before an expensive flow-table, ACL, connection-tracking, or slow-path lookup, test whether the key is definitely absent. Because a miss can be answered from a small bit array that stays in L1/L2, it cuts cache misses and pointer chasing when most packets miss.

The filter has false positives but no false negatives if maintained correctly. A false positive just sends a packet to the expensive path unnecessarily; a false negative would be a correctness bug (you would wrongly skip the real lookup). The false-positive rate is roughly (1 - e^(-kn/m))^k for m bits, k hash functions, n items, minimized at k = (m/n) ln 2; as n approaches the design m, the filter saturates and nearly everything reads as present, so it stops filtering and you pay hashing cost for no benefit.

Deletes are the sharp edge: a plain Bloom filter cannot remove an item (clearing bits could erase other keys), so use a counting Bloom filter (per-slot counters, which then risk counter overflow), periodic epochal rebuild, or generation tagging. In a concurrent datapath, swapping in a rebuilt filter has the same lifetime issues as any read-mostly structure: readers must not see a partially initialized bitset, and the old filter must not be freed while readers still touch it (RCU/epoch). A cheaper modern alternative when you need deletes is a cuckoo filter, which stores fingerprints and supports deletion directly.

What they're listening for: The trap is treating Bloom filters as exact membership or forgetting saturation and delete semantics; bonus for the k = (m/n)ln2 sizing and counting/cuckoo filter for deletes.
Follow-ups
  • How would you size m and k for a target false-positive rate?
  • Why can a counting Bloom filter still fail (overflow)?
  • When is a tiny exact cache better than a Bloom filter?
Design a timer facility for millions of connection timeouts in a userspace datapath. Compare a binary heap with a hierarchical timing wheel.staff

A binary heap gives precise ordering and O(log n) insert/delete, but every packet that refreshes a connection timeout becomes a heap update, and with millions of flows and frequent refreshes the log n factor plus cache misses (the heap is not locality-friendly) dominate.

A timing wheel (Varghese & Lauck) buckets timers by expiry tick into a circular array; insert and per-tick expiry are O(1) amortized. A single wheel only covers slots * tick of time, so a hierarchical wheel stacks coarser wheels (like clock hands): when the low wheel wraps, the next entry of the higher wheel is cascaded down and its timers redistributed into the fine wheel. That bounds memory while covering long timeouts, and it fits connection timeouts because exact precision is unnecessary; expiring within a tick is fine. Using power-of-two slot counts turns the per-level index into a mask.

The hard parts are cancellation and refresh, which dominate real traffic. You do not want to pay a list removal on every keepalive. The standard trick is lazy/deferred expiry: store an absolute deadline (or an expiry generation) in the connection object and, when a bucket fires, revalidate, if the stored deadline is now in the future, the flow was refreshed, so silently re-insert it instead of timing it out. That means a refresh is just a field write; no wheel surgery. Per-core wheels avoid cross-core locking (each core owns its flows; migrated flows need explicit handoff). Finally, the tick must come from a monotonic clock, and after a scheduling pause the wheel may owe many ticks at once, so catch-up must be bounded per iteration (process a budget of buckets, yield, resume) or one stall turns into an unbounded expiry storm that drops live traffic.

What they're listening for: A staff answer explains cascading between wheel levels, lazy/deadline-revalidation so refresh is a field write (not a wheel op), per-core ownership, and bounded catch-up after a pause; not just big-O.
Follow-ups
  • How exactly does cascading move timers between levels?
  • Why does deadline-revalidation make refresh cheap?
  • What happens after a 200 ms scheduling pause, and how do you bound it?
Given a packet metadata struct used by every stage of the datapath, how would you make its layout cache-conscious without making it unmaintainable?staff

First profile field access by stage. Put fields touched on every packet in the first cache line: data pointer, length, offload flags, RSS hash, queue id, and a compact flow/classification result if needed (this mirrors how rte_mbuf deliberately groups hot fields into its first 64-byte line). Move cold fields, timestamps, debug/trace, rarely used tunnel metadata, into a later line or out of line entirely. The goal is that the common per-packet path touches one cache line, so a burst of packets costs one miss each rather than two or three.

Avoid false sharing by ensuring per-packet metadata is owned by one core at a time, or by separating producer-owned and consumer-owned fields onto different lines so two cores updating the same buffer do not ping-pong a shared line. For arrays of descriptors, use fixed-size aligned entries so hardware/software prefetch strides cleanly. For optional metadata, use an index or pointer into side storage rather than bloating the common case.

The maintainability trick is to encode intent and enforce it mechanically: comment the cache-line boundaries, group fields by access phase, and add static_assert(sizeof(struct meta) == 128, ...) and static_assert(offsetof(...) < 64, ...) so an accidental field addition that pushes a hot field past the line, or grows the struct, fails the build instead of silently regressing performance. Avoid over-packing into bitfields when it forces read-modify-write on a hot word or generates branchy code. The best layout is one whose performance properties survive the next feature addition because the build guards them.

What they're listening for: This probes mature cache reasoning: hot/cold split into the first line (rte_mbuf-style), producer/consumer ownership to kill false sharing, prefetch-friendly arrays, and static_asserts on size/offset as drift guards.
Follow-ups
  • How would you measure field hotness in production?
  • When do bitfields hurt more than they help?
  • What static assertions would you add and why?