โกKernel Bypass & Userspace Networking
Probes whether a senior engineer can reason from NIC rings, memory ordering, batching, flow steering, and real bypass APIs to production failure modes.
Post a work request, the NIC executes it zero-copy โ no kernel, no copy.
A trading application sees lower median latency after moving from Linux sockets to DPDK, but worse p99.99 during market data bursts. Walk through the cost model of kernel bypass and the ways it can still lose at the tail.senior
Bypass removes or amortizes costs that dominate small packets: syscalls, readiness notification (epoll), scheduler handoff, skb allocation/free, kernel/userspace copies, protocol-stack branching, and cache pollution from touching generic socket state. DPDK also keeps RX/TX queues, mbufs, mempools, and the poll loop hot on a chosen lcore, so the path is branch-predictable and cache-resident.
The tail can get worse because bypass makes the application responsible for work conservation and backpressure:
- A poll loop can starve housekeeping, timers, logging, or another queue if lcore placement is wrong.
- Burst APIs improve throughput but add queueing delay when the loop waits to accumulate work or drains too much from one source before checking others.
- Hugepage-backed mbuf pools can still run empty under incast; failed RX descriptor replenishment becomes NIC
rx_nombufdrops. - RSS imbalance or a hot flow can overflow one hardware queue while other cores sit idle.
- PCIe DMA, DDIO/LLC allocation, IOTLB pressure, NUMA misses, and false sharing can dominate once syscalls are gone.
- The kernel no longer absorbs bursts with qdisc/socket buffers; ring sizes are now explicit latency-versus-loss tradeoffs.
A strong diagnosis separates service time from waiting time: measure NIC drops, rx_nombuf, ring occupancy, per-queue packet rates, burst sizes, loop iteration time, LLC misses, remote NUMA accesses, and whether the p99.99 coincides with RX refill, TX completion cleanup, mempool depletion, or a flow-steering hotspot. A median win with a tail regression almost always means a queueing or placement problem the kernel used to hide, not a slower datapath.
- Which counters would you read first in DPDK and on the NIC?
- When would you intentionally reduce burst size?
- How would you prove this is not packet pacing or exchange-side burstiness?
Explain the memory-ordering requirements for a userspace RX ring shared with a NIC or kernel producer. What breaks on weakly ordered CPUs if the barriers are wrong?staff
There are two distinct orderings to preserve: descriptor/data visibility and index/doorbell visibility. The producer must publish packet data and descriptor fields before it advances the producer index or writes an event. The consumer must read the index/event before reading the descriptor, and must not recycle a buffer before it has finished consuming the packet.
On x86, TSO (stores are not reordered with stores, loads not with loads) hides many bugs, but device DMA and MMIO are not ordinary cached stores. On Arm or other weakly ordered systems, a missing acquire can let the app observe a new producer index and then read stale descriptor contents. A missing release before refilling a fill ring can let the NIC see an available descriptor before the address/length is valid. A missing MMIO write barrier before a TX doorbell can ring the NIC before descriptors are coherent in memory.
A minimal SPSC-style consumer pattern is:
uint32_t prod = atomic_load_explicit(&ring->prod, memory_order_acquire);
if (cons != prod) {
struct desc *d = &rx[cons & mask];
handle_packet(d->addr, d->len); /* ordered after the acquire load */
atomic_store_explicit(&ring->cons, cons + 1, memory_order_release);
}
Real NIC code also needs the driver/API-specific DMA sync rules, cacheline alignment, and MMIO ordering primitives. For DPDK, use the PMD and the rte_eth_rx_burst() contract rather than inventing descriptor access; the PMD inserts rte_io_wmb() / rte_io_rmb() and the doorbell write in the right places. For AF_XDP, follow the libxdp/libbpf ring helpers, which carry the __atomic acquire/release on the shared producer/consumer indices.
- Why can a bug reproduce on Arm but not x86?
- Where would you put cacheline padding?
- What changes for an MPSC software ring such as `rte_ring`?
In DPDK, what are the important semantics and traps around `rte_eth_rx_burst()`, `rte_eth_tx_burst()`, `rte_mbuf`, and mempools?senior
rte_eth_rx_burst() returns up to the requested number of packets from one port/queue into an array of struct rte_mbuf *. Returning fewer than the burst size is not an end-of-stream signal; it usually just means the queue had fewer packets at that instant. rte_eth_tx_burst() similarly returns the count actually accepted into the TX ring, which may be fewer than requested; the application owns the unsent mbufs (typically rte_pktmbuf_free() them or retry).
The mbuf is both metadata and the packet buffer handle. Default data room is RTE_MBUF_DEFAULT_BUF_SIZE (2048) plus RTE_PKTMBUF_HEADROOM (128 bytes). Senior-level traps:
- Leaking mbufs on partial TX or error paths, which silently drains the pool until
rx_nombufdrops appear. - Touching packet data before prefetching metadata/data at high packet rates, so every access stalls on a cache miss.
- Forgetting that scattered/jumbo packets span multiple mbuf segments chained via
next, requiring segment traversal and multi-seg offload flags. - Allocating one mempool on the wrong NUMA socket, causing remote memory and IOTLB pressure on every packet.
- Sizing mempools for average rate, not for bursts plus RX ring fill plus TX in-flight plus application queues plus the per-lcore mempool cache.
- Assuming offload flags (
RTE_MBUF_F_TX_*) are portable across PMDs without checkingrte_eth_dev_infocapabilities.
The PMD owns hardware-specific descriptor handling. The application owns queue affinity, burst policy, mbuf lifetime, and backpressure. Most production packet loss in DPDK is not the rx_burst call itself; it is failure to keep RX descriptors replenished, or a hot queue that the lcore cannot drain fast enough.
- What does `rx_nombuf` tell you?
- How do you size an mbuf pool for N RX queues?
- When is software prefetch harmful?
How do RSS, flow steering, and queue affinity interact in a kernel-bypass design, and what are the failure modes in an AI-cluster or market-data workload?senior
RSS hashes selected header fields (usually a Toeplitz hash over the 4-tuple, keyed by an RSS key) into a receive queue via an indirection table. Flow steering adds explicit rules: match a tuple, VLAN, tunnel field, MAC, or application filter and direct it to a queue or VI (on Solarflare/AMD adapters these are hardware filters that steer to an ef_vi). Queue affinity then binds that queue to a polling core and memory locality domain.
Failure modes are workload-shaped:
- A single elephant flow pins to one queue and saturates one core while aggregate NIC utilization looks low, because RSS hashes a flow to exactly one queue.
- Symmetric/bidirectional traffic may not land on the same queue in both directions unless the RSS key and tuple selection are designed to be symmetric.
- Encapsulation hides entropy if the NIC hashes only outer headers or lacks the tunnel parser, so all VXLAN/Geneve traffic collapses onto few queues.
- Multicast market data may all arrive on one queue unless hardware filters or replication are configured deliberately.
- AI collectives create incast where many senders hash to a small set of receiver queues during all-reduce phases.
- Reprogramming filters during live traffic can cause packet reordering, drops, or transient delivery to the kernel path.
A staff engineer asks what invariant the queue mapping must preserve: cache locality, per-flow ordering, load balance, or isolation. You rarely get all four at once, and steering changes mid-flight trade ordering for balance.
- How would you validate the NIC is hashing inner tunnel headers?
- What is the cost of preserving per-flow ordering?
- How would you steer multicast without duplicating too much work?
Compare DPDK, AF_XDP zero-copy, ef_vi, and Onload for a low-latency service. When would you choose each? Be precise about what layer each one operates at.staff
These are not interchangeable; they sit at different layers, which is the whole point.
ef_vi is the Solarflare/AMD layer-2 API. You get direct access to a virtual interface with RX/TX descriptor rings and an event queue, and you send/receive raw Ethernet frames. There is no TCP, no UDP, no sockets; the application (or a library on top) owns all protocol logic. It targets the lowest possible latency and exposes NIC-specific accelerations like CTPIO. Choose it when hardware and application are co-designed and you want absolute control of the wire.
Onload is a userspace TCP/UDP stack with a sockets shim. It is LD_PRELOAD-ed so ordinary socket()/send()/recv() calls are intercepted and serviced by a userspace stack over ef_vi underneath, bypassing the kernel for eligible flows. The application keeps its POSIX sockets code unchanged. The tradeoff: not every socket option, traffic pattern, or kernel interaction is accelerated, and non-accelerated operations fall back to the kernel stack. So ef_vi gives you raw frames and you write the protocol; Onload gives you a real TCP stack behind the BSD socket API. They are layered, not competitors.
DPDK is a portable userspace packet-I/O framework: PMDs, mbufs, mempools, rings, poll-mode drivers, hugepages, a large ecosystem. Like ef_vi it is layer-2/raw; you supply or import the L3/L4 stack. Strong for custom datapaths, appliances, telemetry, and multi-vendor deployments, but the application owns the network stack and operations model.
AF_XDP is a Linux-native packet path built on XDP and UMEM rings. With XDP_ZEROCOPY, bind() fails if zero-copy is unavailable; without it the socket can run in copy mode. Attractive when you want XDP integration, kernel coexistence, and less framework weight than DPDK, but driver support, queue binding, UMEM lifecycle, and need_wakeup semantics matter.
The decision is about required semantics, not headline latency: raw packets versus sockets, custom protocol versus TCP correctness, operational portability, NIC feature dependence, observability, and what failure mode you tolerate under overload.
- What breaks if an Onload flow falls back to the kernel?
- How would you detect AF_XDP silently running in copy mode?
- Where does TCP retransmission logic live in each design?
In Solarflare/AMD ef_vi terms, what is a VI, what are event queues for, and why do CTPIO/TX_PUSH-style paths matter?senior
A virtual interface is the application-visible NIC endpoint. In ef_vi it bundles an RX descriptor ring, a TX descriptor ring, an event queue, doorbells, and related state. The application posts receive buffers, posts transmits, and polls the event queue for completions and receive events. Multiple VIs can be created and steered to independently via hardware filters.
The event queue is the unified completion/notification path. It reports which RX descriptors produced packets, which TX descriptors completed, and exceptional events. It is also where batching tension shows up: polling fewer events per call reduces overhead, but delayed completion processing can hold TX buffers too long or delay RX reposting, starving the RX ring.
CTPIO (cut-through PIO) and TX_PUSH are latency optimizations for small sends. Instead of writing descriptors to memory and ringing a doorbell so the NIC DMAs the payload later, the CPU streams (pushes) packet bytes across PCIe directly into the adapter's transmit path. CTPIO can even begin putting the frame on the wire while it is still streaming across PCIe; a ct_threshold controls how many bytes are buffered before cut-through starts. This saves a PCIe read round trip and cuts small-packet latency on supported adapters (TX push threshold, for example, is honored only on certain NIC generations).
The gotchas are hardware-specific size/alignment limits and fallback behavior. If a frame exceeds the CTPIO limit or the path is unavailable, the send falls back to normal DMA, which creates latency bimodality. You need counters that distinguish CTPIO sends, fallback sends, TX-queue-full, and completion latency, or the p99 will mysteriously double under conditions you cannot see.
- What counters would show CTPIO fallback?
- Why can increasing `EF_RXQ_SIZE` reduce drops but hurt latency?
- How do you avoid running out of posted RX buffers?
Sketch an ef_vi poll loop: how do you process RX and TX completions correctly from one event queue?staff
ef_vi delivers everything (RX arrivals, TX completions, errors) through a single event queue you drain with ef_eventq_poll(). The loop pulls a batch of events, switches on EF_EVENT_TYPE(), and crucially must repost RX buffers and reclaim TX buffers so neither ring starves.
ef_event evs[64];
int n = ef_eventq_poll(&vi, evs, 64);
for (int i = 0; i < n; i++) {
switch (EF_EVENT_TYPE(evs[i])) {
case EF_EVENT_TYPE_RX: {
int id = EF_EVENT_RX_RQ_ID(evs[i]);
int len = EF_EVENT_RX_BYTES(evs[i]);
handle_rx(pkt_buf(id), len);
ef_vi_receive_post(&vi, dma_addr(id), id); /* repost the buffer */
break;
}
case EF_EVENT_TYPE_TX: {
ef_request_id ids[EF_VI_TRANSMIT_BATCH];
int t = ef_vi_transmit_unbundle(&vi, &evs[i], ids); /* reclaim sent */
for (int j = 0; j < t; j++) free_tx_buf(ids[j]);
break;
}
case EF_EVENT_TYPE_RX_DISCARD:
case EF_EVENT_TYPE_TX_ERROR:
handle_error(&evs[i]);
break;
}
}
The correctness traps: a single TX event from ef_eventq_poll() represents potentially many completed sends, so you must call ef_vi_transmit_unbundle() to expand it into the individual request IDs (forgetting this leaks TX descriptors and the TX ring wedges). You must repost RX buffers via ef_vi_receive_post() or the RX ring drains and you silently stop receiving. Discard events (EF_EVENT_TYPE_RX_DISCARD) still own a buffer that must be reposted. And the event queue itself has finite capacity; if you stop polling, it overflows and you lose the completion stream.
- What happens if the event queue overflows?
- How large would you make the event poll batch and why?
- Why must RX_DISCARD events still repost the buffer?
Show how you would use a DPDK rte_ring to hand packets from an RX (I/O) lcore to a set of worker lcores, and explain the SPSC vs MPSC choice.senior
A common run-to-completion split is one RX lcore polling the NIC and enqueuing mbuf pointers into a rte_ring, with worker lcores dequeuing. rte_ring is a lock-free bounded FIFO of pointers; the sync mode is fixed at creation via flags, and matching it to your topology avoids needless CAS contention.
/* One RX producer, one worker consumer => single-prod / single-cons. */
struct rte_ring *r = rte_ring_create("work", 4096, socket_id,
RING_F_SP_ENQ | RING_F_SC_DEQ);
/* RX lcore: */
struct rte_mbuf *bufs[32];
uint16_t nb = rte_eth_rx_burst(port, q, bufs, 32);
unsigned sent = rte_ring_enqueue_burst(r, (void **)bufs, nb, NULL);
if (sent < nb) /* ring full: backpressure */
rte_pktmbuf_free_bulk(&bufs[sent], nb - sent);
/* Worker lcore: */
struct rte_mbuf *rx[32];
unsigned got = rte_ring_dequeue_burst(r, (void **)rx, 32, NULL);
for (unsigned i = 0; i < got; i++)
process(rx[i]);
The SPSC vs MPSC decision: _burst returns however many it could move (unlike _bulk, which is all-or-nothing). If exactly one lcore enqueues and one dequeues, create with RING_F_SP_ENQ | RING_F_SC_DEQ for the cheapest path (no atomic CAS on the head/tail). If several RX lcores feed one ring, you need multi-producer mode (the default, no RING_F_SP_ENQ), which uses a CAS loop to reserve slots and is measurably slower under contention. The classic bug is creating an SP/SC ring and then having two lcores enqueue into it, which silently corrupts the ring because the fast path assumes a single writer. The other gotcha is the unsent tail on a full ring: you must free those mbufs or you leak the pool.
- Why is MPSC enqueue more expensive than SPSC?
- When would you use `rte_ring_dequeue_bulk` instead of `_burst`?
- How does the ring's bounded size give you backpressure?
Why does DPDK insist on hugepages, and what is the IOVA-as-PA versus IOVA-as-VA distinction? What breaks if you get memory wrong?staff
Hugepages exist to control the TLB and to give the NIC physically usable addresses. A poll loop touching millions of small 4 KB pages thrashes the TLB; 2 MB or 1 GB hugepages cut TLB misses by orders of magnitude and keep packet buffers in a few translations. Equally important, DPDK needs large physically contiguous (or at least IOMMU-mappable) regions because the NIC DMAs directly into buffers using bus addresses (IOVA), and hugepages let the allocator hand out big contiguous chunks (memsegs sliced into memzones).
The two IOVA modes are about what address the NIC sees:
- IOVA-as-PA: the IOVA equals the physical address. The NIC DMAs straight to physical memory. This requires DPDK to learn physical addresses (historically via
/proc/self/pagemap, which needs privilege) and requires buffers to be physically contiguous. No IOMMU protection. - IOVA-as-VA: an IOMMU (VT-d/SMMU, via VFIO) translates a chosen IOVA to physical pages, so DPDK can use virtual-address-like IOVAs and physical contiguity is not required. This gives memory protection (a buggy NIC/driver cannot DMA outside its mappings) and is the safer modern default when an IOMMU is available.
What breaks if you get it wrong: too few hugepages and mempool allocation fails or you silently fall back to a tiny pool that depletes under load. Wrong NUMA socket and every packet crosses the interconnect, adding latency and IOTLB misses. In IOVA-as-PA without an IOMMU, a descriptor with a bad address can let the NIC DMA over arbitrary memory. Under IOVA-as-VA, IOTLB pressure from too many small mappings can itself become a bottleneck at high packet rates.
- Why does IOVA-as-PA historically need elevated privilege?
- How does the IOMMU change the failure mode of a buggy descriptor?
- When can IOTLB pressure become the bottleneck?
AF_XDP zero-copy is dropping packets even though CPU utilization is below 60%. What would you inspect, and how do the four rings and need_wakeup actually work?senior
CPU utilization is a poor signal for a polling datapath. An AF_XDP socket has four rings around a shared UMEM: the fill ring (app hands free frames to the kernel for RX), the RX ring (kernel hands filled frames to the app), the TX ring (app hands frames to send), and the completion ring (kernel returns sent frames). RX stalls the instant the fill ring is empty; TX frames are unusable until you drain the completion ring.
I would inspect, in order:
- Is the socket actually zero-copy? Bind with
XDP_ZEROCOPYso it errors instead of silently falling back to copy mode, and queryXDP_OPTIONSforXDP_OPTIONS_ZEROCOPY. - Are fill-ring entries available? If you stop replenishing the fill ring, the driver has no frames to DMA into and RX simply stops.
- Is the completion ring drained every loop?
- Are producer/consumer indices advanced with the correct libbpf/libxdp helpers (with the right acquire/release)?
- Is
need_wakeuphandled? WithXDP_USE_NEED_WAKEUP, the kernel sets a flag asking you to kick it:recvfrom()/poll()wakes RX/fill processing andsendto()/poll()wakes TX. If you ignore the flag, packets sit in the rings while you spin. If you syscall on every iteration when the flag is clear, you waste cycles. - Is the XDP program
bpf_redirect_map()-ing to the correctxsks_mapentry for this queue id? - Are UMEM frame size, headroom, alignment, and MTU compatible with the packets?
- Are NIC drops isolated to one RX queue due to RSS/flow steering?
Then correlate kernel/NIC stats, XDP drop/pass/redirect counters, ring occupancy, and a timestamped poll-loop trace. Most AF_XDP loss is fill-ring starvation, an unhandled need_wakeup, or a queue/map mismatch, not raw CPU saturation.
- What does `XDP_SHARED_UMEM` change?
- How do copy mode and zero-copy mode alter the bottleneck?
- Where can packets be dropped before reaching the XSK?
How does io_uring for networking compare with kernel bypass, especially with multishot receive and zero-copy send/receive?senior
io_uring is not classic kernel bypass. It keeps full kernel socket and TCP semantics, but cuts syscall and readiness-notification overhead through shared submission/completion rings and a completion-oriented model. Networking can use multishot receive, provided buffers, registered files/buffers, and zero-copy send via io_uring_prep_send_zc().
Multishot receive (io_uring_prep_recv_multishot) posts one SQE that keeps producing CQEs as data arrives, each consuming a buffer from a provided buffer group. The signal that the request is still armed is IORING_CQE_F_MORE in the CQE flags; when it is absent the multishot has terminated and must be rearmed. This removes one SQE per receive.
Zero-copy send has the same lifetime rule as MSG_ZEROCOPY: the operation generates a notification CQE, and the user buffer must not be reused until that notification fires, which is after (and separate from) the CQE reporting bytes sent. Treating send acceptance as buffer-free is the classic corruption bug.
Zero-copy receive via io_uring is newer and aims to remove the kernel-to-user payload copy while the kernel still processes headers and TCP state. That places io_uring between sockets and DPDK: more Linux semantics than bypass, less packet-level control than DPDK or ef_vi.
I would choose io_uring when I need many concurrent sockets, kernel TCP behavior, fewer syscalls, and a unified async model. I would choose bypass when I need raw packet control, hardware queue ownership, deterministic polling, or to eliminate kernel TCP/socket overhead entirely.
- What does `IORING_CQE_F_MORE` mean for multishot receive?
- Why might epoll still match or beat io_uring for simple networking?
- How would you feature-probe zero-copy support?
Once syscalls are gone, why can the PCIe/cache subsystem (DDIO, IOTLB, NUMA) dominate latency, and how would you reason about it?staff
When the software path is a tight poll loop, the remaining variable cost is moving bytes between the NIC, memory, and the CPU's caches. On Intel, DDIO (Data Direct I/O) lets inbound DMA write packet data straight into the last-level cache instead of DRAM, so the core finds the packet warm. That is a latency win, but DDIO is restricted to a limited number of LLC ways (historically about 2 of the LLC's N ways). If the working set of in-flight buffers exceeds what those ways hold, you hit the leaky-DMA / write-allocate problem: the NIC's writes force evictions and writebacks to DRAM, and the core then misses to DRAM on packet access, so DDIO silently stops helping under load or with a noisy LLC neighbor.
The other costs:
- IOTLB: with an IOMMU, each DMA address must be translated; too many distinct mappings (large buffer pools, scattered frames) cause IOTLB misses that add latency at high packet rates.
- NUMA: if the NIC, its queues, the mempool, and the polling core are not on the same socket, every packet crosses the interconnect, costing latency and remote-memory bandwidth.
- PCIe round trips: a TX doorbell that makes the NIC read descriptors from memory is a PCIe read latency; this is exactly what CTPIO/TX_PUSH eliminate by pushing data to the NIC.
Reasoning approach: pin NIC, queues, memory, and core to one NUMA node; size buffer pools so the hot set fits DDIO ways; watch LLC miss rate and remote-DRAM bandwidth (PMU/uncore counters), IOTLB miss counters, and whether tuning ring/pool size changes the LLC footprint. The tell of a DDIO problem is that latency degrades as pool size or co-tenant LLC pressure grows even though packet rate is constant.
- How would you size buffer pools to stay within DDIO ways?
- Which PMU/uncore counters would you watch?
- How does CTPIO remove a PCIe read round trip?
You need to choose RX ring size, burst size, and poll strategy for a sub-10 microsecond target. What tradeoffs do you make?staff
Ring size is burst tolerance versus queueing and cache footprint. A larger RX ring absorbs scheduler hiccups or incast, but it also lets more packets wait before the app notices overload, and a bigger ring spills out of cache. For ultra-low latency I want enough descriptors to survive known bursts and refill latency, not so many that they hide a broken consumer.
Burst size is amortization versus head-of-line waiting. Larger bursts cut per-packet overhead, doorbells, and call cost. Smaller bursts cut the time spent draining one queue before checking timers, control messages, or a higher-priority queue. A common low-latency design uses bounded bursts, prefetches ahead, and explicitly budgets each loop iteration.
Poll strategy is latency versus power and CPU isolation. Pure busy-polling avoids interrupt and scheduler latency but burns a core and can starve colocated work. Interrupts save CPU but add wakeup latency and jitter. Hybrid (poll then arm an interrupt) needs careful thresholds; the mode switch itself can create tails.
The answer must be measurement-driven: plot ring occupancy, loop time, packets per poll, empty-poll rate, p50/p99/p99.99 latency, drops, and core C-state/IRQ noise. For a sub-10 us target, BIOS power settings (disable C-states/P-state transitions), core isolation (isolcpus/nohz_full), NUMA locality, IRQ affinity, PCIe locality, and NIC interrupt moderation are part of the datapath, not afterthoughts.
- How would you set interrupt moderation for this path?
- When does busy polling make p99 worse?
- How would you test incast without fooling yourself?