๐งฌLinux Kernel Network Datapath
Probes whether a senior engineer can reason about Linux packet paths as a set of queues, cache lines, timers, memory ownership transitions, and observability tradeoffs.
Burst packets: one interrupt, a budgeted batch drain, then re-arm the IRQ.
A low-latency UDP workload shows p99 spikes only under bursty receive load, and perf points at `NET_RX` softirq. Walk through how NAPI changes the interrupt path and what you would tune or instrument before blaming the NIC.senior
NAPI turns an interrupt-per-event model into interrupt-to-schedule plus polling. The ISR typically masks or acknowledges the queue interrupt, calls napi_schedule() (or __napi_schedule() after napi_schedule_prep() if it masks IRQs explicitly), and packet/completion work runs from net_rx_action() in softirq context or from threaded NAPI if configured. The poll function receives a budget; Tx completions can be reclaimed independently, but Rx/XDP/page_pool processing must respect the Rx budget, including the special case that budget may be 0.
I would check whether latency is caused by too much work per softirq or not enough polling before reinterrupting:
/proc/net/softnet_statcolumns for backlog drops (col 2),time_squeeze(col 3, polls that hitnetdev_budget/netdev_budget_usecsand yielded), and per-CPU imbalance.perf top,perf sched,trace-cmdor ftrace events aroundnapi_poll, IRQ entry, andnetif_receive_skb.- Queue-to-CPU affinity, MSI-X vector affinity, RSS indirection, RPS/RFS, and whether
ksoftirqdis taking over (which happens when a softirq round exceeds its budget and reschedules). net.core.netdev_budget,netdev_budget_usecs,dev_weight, IRQ coalescing, and driver NAPI weight (commonly 64).
For ultra-low latency I would test smaller coalescing and possibly SO_BUSY_POLL/SO_PREFER_BUSY_POLL, but only with CPU isolation and power-state control. Busy polling can reduce wakeup latency while increasing CPU burn and cache interference. The real trap is a microburst arriving while ksoftirqd is descheduled behind the application on the same CPU.
- What does `napi_complete_done()` return, and what must the driver not do if it returns false?
- When does ksoftirqd enter the picture, and why can it lose to the app on the same core?
- Why can a budget of 0 surprise an XDP-capable driver?
Explain the `sk_buff` data layout a driver must respect on receive and transmit: headroom, tailroom, linear data, frags, and ownership. What bugs appear only at high packet rate?senior
sk_buff is metadata plus a data area described by head, data, tail, and end, with optional page fragments in skb_shinfo(skb)->frags and a possible frag_list for chained skbs. Headroom (skb_headroom()) is space before data for pushing headers; tailroom (skb_tailroom()) is space after tail for appending. Drivers must account for required alignment, NET_IP_ALIGN padding so the IP header lands 16-byte aligned, checksum metadata, VLAN tags, and whether data is linear or paged. skb->len is total bytes; skb->data_len is the paged portion; skb_headlen() is the linear part.
On Rx, a driver often builds an skb around a DMA-filled page or page fragment (build_skb()/napi_build_skb()), reserves headroom, pulls enough header into the linear area, sets skb->protocol, checksum, and hash metadata, then passes it to GRO or the stack. On Tx, ndo_start_xmit() must handle SG fragments, DMA-map every segment, and must not modify shared parts of a cloned skb. Once it returns NETDEV_TX_OK, the driver owns eventual completion and freeing.
High-rate bugs include:
- Writing before
skb_headroom()or afterskb_tailroom()because a fast path assumes linear room. - Forgetting
skb_cow_head()before pushing headers on a possibly-cloned skb (skb_cloned()true), corrupting another reference's data. - Leaking skbs when Tx completions are interrupt-coalesced but no later interrupt arrives to reclaim them.
- Mapping too many fragments for the ring and returning
NETDEV_TX_BUSYafter partially taking ownership. - Cacheline contention on skb metadata and false sharing in per-queue stats.
- When would you use `napi_build_skb()` rather than `build_skb()`?
- How do checksum offload flags (`CHECKSUM_PARTIAL` vs `CHECKSUM_UNNECESSARY`) alter what the stack expects?
- What does a driver do if the skb has more frags than the NIC's descriptor count supports?
GRO, GSO, TSO, and LRO all reduce per-packet overhead. How do you choose between them for latency-sensitive TCP, and what failure modes do they introduce?senior
GSO is software segmentation late in the transmit path (often just before the driver, or in the driver via skb_gso_segment() if the NIC can't offload); TSO lets the NIC segment a large skb described by gso_size/gso_segs. GRO merges received packets in the kernel before protocol processing via the NAPI GRO path. LRO is hardware or driver aggregation and is more dangerous because it can break forwarding, packet visibility, and protocol assumptions if it coalesces packets that should stay distinct; the kernel force-disables LRO when a device is enslaved to a bridge or used for routing, whereas GRO is reversible.
For latency-sensitive TCP I do not blindly disable everything. TSO/GSO reduce CPU and PCIe descriptor pressure on large sends but can create head-of-line effects if a huge skb monopolizes a queue. GRO reduces receive CPU cost but can delay delivery until a flow, a gro_flush_timeout, or a NAPI/budget boundary flushes. On tiny-message trading traffic the aggregation benefit is low and the batching hurts tail latency. On AI-cluster control or storage traffic the throughput benefit is usually worth it.
Failure modes:
- GRO hides microbursts from socket-level timestamps and from
tcpdump(the AF_PACKET tap sits above GRO, so you see merged super-segments, not wire packets). - Large TSO skbs can inflate BQL/accounting and delay small control packets unless queueing separates them.
- LRO can be invalid for routers/bridges and can interfere with IPsec, tunneling, or classifiers.
- Checksum, MSS, tunnel-offload, or VLAN metadata bugs often appear only with mixed encapsulation.
I compare with ethtool -k/-K, NIC counters, tc -s, hardware timestamps if available, and captures at the right layer.
- Why does GRO make `tcpdump` show packets larger than the MTU?
- How does TSO interact with qdisc and BQL byte accounting?
- What would you disable first when debugging checksum corruption, and why GRO before TSO?
You have 64 Rx queues on a 64-core system, but one queue is hot and drops while the others are idle. Explain RSS, RPS, RFS, XPS, and flow steering choices, including when software steering makes things worse.staff
RSS is hardware queue selection, hashing headers such as a 4-tuple through an indirection table, typically with Toeplitz. It is cheapest because packets land on the intended Rx queue and MSI-X vector directly. If one elephant flow dominates, a hash cannot split a single ordered flow across queues without reordering; you need application sharding, protocol changes, hardware flow-director rules, or accepting one hot queue.
RPS steers receive processing in software after the interrupting CPU has already paid DMA completion and initial driver cost; it enqueues to another CPU's backlog and fires an IPI. It can help when hardware queues are scarce, but it adds IPIs, cache movement, and queueing. RFS steers packets toward the CPU where the consuming socket last ran, using rps_sock_flow_table plus per-queue rps_flow_cnt. Accelerated RFS pushes that decision into hardware: the stack calls the driver's ndo_rx_flow_steer() to program a filter, and the core calls rps_may_expire_flow() to know when a filter can be reclaimed. XPS chooses Tx queues by CPU or flow mapping to preserve locality and cut Tx lock/cache contention.
I inspect ethtool -x (indirection table), ethtool -n/-N (ntuple filters), /sys/class/net/.../queues/rx-*/rps_cpus, rps_flow_cnt, IRQ affinity, NUMA locality, and app CPU placement. For ultra-low latency I prefer hardware RSS plus explicit IRQ/app pinning over broad RPS masks, because software steering can turn one hot queue into cross-CPU cache misses and IPI storms โ strictly more work than the original problem.
- How does symmetric (Toeplitz) RSS keep both directions of a flow on one queue?
- What is the risk of moving both directions of a TCP flow to one queue?
- How would you debug a wrong RSS hash for VXLAN/encapsulated traffic?
Compare XDP native mode, generic XDP, TC ingress, and AF_XDP for dropping, redirecting, and userspace delivery. Where are the sharp edges?staff
Native XDP runs in the driver receive path before skb allocation, so XDP_DROP, XDP_TX, and XDP_REDIRECT can avoid much of the stack. Generic XDP (XDP_FLAG_SKB_MODE) runs later using skb infrastructure, so it is portable but loses the main early-drop advantage. TC ingress (sch_clsact) runs after skb creation and has richer integration with the qdisc/classifier/action machinery, but it pays skb-alloc cost.
AF_XDP uses an XDP program, usually redirecting through an XSKMAP, to deliver frames to userspace UMEM. The socket has Rx/Tx rings and the UMEM has Fill/Completion rings. Those rings are single-producer/single-consumer for performance, so sharing a ring across threads without external serialization is a bug. Zero-copy depends on driver support; copy mode still works but changes the performance model.
Sharp edges:
XDP_REDIRECTreturns to the program immediately but the actual transfer is batched; the driver must callxdp_do_flush()once at the end of the NAPI poll or redirected frames stall. Core helpers usually handle this.- Redirect can drop silently if the map entry is empty or the target device/queue isn't bound.
- Headroom and metadata requirements differ between XDP (
XDP_PACKET_HEADROOM) and skb paths. - AF_XDP Fill-ring starvation looks like NIC packet loss even though the NIC is healthy โ the driver has no buffers to post.
- Program complexity is bounded by the verifier; multi-buffer (jumbo) frames need
bpf_xdp_load_bytes()/bpf_xdp_adjust_*rather than direct access past the linear segment.
A minimal classifier bounds-checks every access:
SEC("xdp")
int drop_non_ipv4(struct xdp_md *ctx)
{
void *data = (void *)(long)ctx->data;
void *end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > end)
return XDP_ABORTED;
if (eth->h_proto != bpf_htons(ETH_P_IP))
return XDP_DROP;
return XDP_PASS;
}
- When would TC ingress be preferable to native XDP?
- What counters show AF_XDP Fill-ring starvation?
- Why does redirect to an empty XSKMAP entry drop instead of erroring?
What really happens between `ndo_start_xmit()`, qdisc, BQL, and a multiqueue NIC? Describe how a driver should stop and wake queues without creating latency cliffs.senior
The qdisc layer selects a transmit queue (via ndo_select_queue() or netdev_pick_tx()/XPS) and eventually calls ndo_start_xmit() holding the per-queue __QUEUE_STATE_DRV_XOFF discipline. On multiqueue devices the driver uses netdev_get_tx_queue() state and per-queue rings. The driver must enqueue the skb to hardware and return NETDEV_TX_OK, or return NETDEV_TX_BUSY only as an exceptional path without keeping or freeing the skb. The softnet guidance treats routine NETDEV_TX_BUSY as a driver queue-management bug.
A correct driver stops a subqueue before it runs out of descriptors, using a threshold that accounts for the worst-case skb (MAX_SKB_FRAGS + 1 plus any context/metadata descriptors). On completion it reclaims descriptors, frees skbs, updates BQL with netdev_tx_sent_queue() on enqueue and netdev_tx_completed_queue() on completion, and calls netif_tx_wake_queue() when enough room exists. Wake too early and you churn BUSY; wake too late and you add artificial latency.
Latency cliffs appear when one large TSO skb fills descriptors, Tx completions are delayed by coalescing, BQL limits are stale, or all traffic shares one qdisc class. I inspect tc -s qdisc, ethtool -S, BQL sysfs under queues/tx-*/byte_queue_limits, and per-queue stop/wake counters.
- Why is `NETDEV_TX_BUSY` after DMA-mapping fragments especially bad?
- How can a single TSO skb consume many descriptors?
- What does BQL optimize (bufferbloat in the ring), and what does it not (qdisc latency)?
Stopping a Tx queue when the ring is full has a notorious SMP race against the completion path that wakes it. Walk through the race and the barrier pattern that fixes it.staff
The race is a lost-wakeup. The xmit path on CPU A fills the last descriptor and decides to stop the queue; the completion path on CPU B frees descriptors and decides whether to wake it. Naively:
/* CPU A: xmit */
if (ring_full(ring))
netif_tx_stop_queue(txq);
/* CPU B: completion, concurrently */
if (desc_freed && netif_tx_queue_stopped(txq))
netif_tx_wake_queue(txq);
Interleaving can lose the wake: A sees the ring full, B frees everything and observes the queue not yet stopped (so skips wake), then A stops it. The queue is now stopped forever with a fully drained ring โ a permanent stall after which the qdisc holds packets and the link goes quiet despite a healthy NIC.
The fix is the stop, then re-check, then wake-if-now-empty pattern with a full barrier, because stop (a producer-side store) must be globally visible before the producer re-reads the consumer index:
/* CPU A: xmit */
netif_tx_stop_queue(txq);
smp_mb(); /* order stop vs. the re-read below */
if (likely(desc_avail(ring) >= NEEDED)) {
netif_tx_wake_queue(txq); /* completion already drained it */
}
netif_tx_stop_queue() uses set_bit() semantics and the helpers __netif_tx_lock/netif_tx_stop_queue carry the needed smp_mb__before_atomic(); the completion side pairs an smp_rmb()/smp_mb() so its read of the descriptor index is ordered against its read of the stopped bit. Many drivers wrap this as netif_tx_maybe_stop_queue(). The underlying lesson: the stop bit and the descriptor index are two independent memory locations updated on different CPUs, so you need a full barrier on each side, not just a compiler barrier.
- Why must it be `smp_mb()` and not `smp_wmb()` on the xmit side?
- How does BQL's `netdev_tx_completed_queue()` have the same lost-wakeup shape?
- Why is per-queue locking not enough to remove the barrier?
Interrupt coalescing improves throughput but can destroy p99 latency. How would you design an adaptive coalescing policy and validate it for a NIC used in both low-latency trading and AI-cluster traffic?staff
Coalescing trades interrupt rate for batching. The ethtool controls include Rx/Tx usecs, max frames, IRQ-specific variants, adaptive Rx/Tx flags, and driver-specific profiles. A low-latency profile uses very small or zero Rx usecs and leans on NAPI/busy polling; a throughput profile accepts tens of microseconds to cut CPU and PCIe pressure.
A modern alternative to per-packet IRQs is deferred hard IRQs: set napi_defer_hard_irqs (consecutive empty polls to tolerate before re-arming hardware IRQs) together with gro_flush_timeout (the kernel re-polls via a timer instead of taking an interrupt). With SO_BUSY_POLL plus an irq-suspend-timeout, an app can suppress hardware IRQs entirely while it is actively polling and fall back to timer-driven polling when idle โ often better tail latency than tuning coalescing usecs alone.
An adaptive policy should not react to instantaneous packet rate. It should weigh queue occupancy, completion rate, packet-size mix, CPU softirq pressure, whether the queue serves latency-critical flows, and hysteresis to avoid oscillation. It must separate Rx and Tx: Tx-completion moderation delays skb freeing and socket send-buffer progress even when the wire is fine.
Validation needs histograms and causality:
- Hardware timestamping (or calibrated software) at ingress/egress.
- Per-queue IRQ rate, NAPI polls, packets-per-poll, and budget exhaustion.
- p50/p99/p99.9 under incast, elephant flow, and mixed small/large packets.
- CPU package C-state and frequency control โ power management can dwarf coalescing changes.
I expose conservative defaults and document that no single setting wins both microburst tail latency and bulk throughput.
- Why can Tx coalescing block application send progress?
- How do `napi_defer_hard_irqs` and `gro_flush_timeout` interact with busy polling?
- How would you prevent adaptive coalescing oscillation?
Explain `page_pool` in the Rx path. Why does it matter for XDP and high packet rate, and what lifecycle bugs would you look for?senior
page_pool is a recycling allocator for pages or page fragments used by skb and XDP receive paths. Instead of repeatedly hitting the buddy allocator and dirtying global state, the driver recycles buffers โ a fast per-NAPI cache (lockless, drained in poll context) plus a ptr_ring for returns from other CPUs โ preserving cache locality and cutting allocation jitter. It also keeps DMA mapping state (PP_FLAG_DMA_MAP) so the driver avoids full map/unmap cycles on every buffer, and can keep buffers in the device domain so it only needs dma_sync_single_for_cpu() for the portion the CPU reads.
The lifecycle is ownership: the device owns DMA buffers on the Rx ring; the CPU owns them after completion and DMA sync; the stack/XDP may take ownership; the page returns to the pool only when the last reference drops. For XDP, drop and XDP_TX recycle quickly; XDP_REDIRECT and AF_XDP can defer return through another device, CPU, or userspace.
Bugs to look for:
- Recycling a page while an skb frag still references it (use-after-free of live data).
- Missing
dma_sync_single_for_cpu()on non-coherent paths before the CPU reads headers. - Returning pages to the wrong pool or wrong NUMA node.
- Fragment accounting mistakes when multiple packets share one page (
pp_frag_count/page_pool_fragment_page()). - Pool starvation when redirect or AF_XDP holds frames too long;
page_poolthen falls back to slow allocation or drops.
I use page_pool stats (/proc ethtool -S pp_alloc_*/pp_recycle_*), Rx alloc-failure counters, DMA debug, and stress with small packets plus delayed consumers.
- Why are page fragments harder than full-page recycling (refcount vs frag count)?
- How can AF_XDP create backpressure into page_pool?
- What changes on non-coherent architectures with `PP_FLAG_DMA_SYNC_DEV`?
Walk through what happens end to end when a packet is received in threaded NAPI versus classic softirq NAPI. Why was threaded NAPI added, and when does it help or hurt latency?staff
In classic NAPI the hardware IRQ schedules NAPI and the poll runs in NET_RX softirq context on the interrupted CPU. Softirqs run on return from IRQ or are pushed to ksoftirqd when a round exceeds its budget. The catch: softirq shares the CPU with whatever else runs there, and ksoftirqd is a normal SCHED_OTHER thread, so under load the receive path can be starved by a busy application on the same core, or conversely can starve the application โ and you cannot prioritize it with the scheduler because softirq context has no schedulable identity per queue.
Threaded NAPI (write 1 to /sys/class/net/DEV/threaded) moves each NAPI instance into its own kernel thread named like napi/eth0-NN. Now receive processing is a schedulable entity: you can set its CPU affinity, its scheduling policy/priority (SCHED_FIFO for isolation), and account it with cgroups. This is the recommended base for the busy-poll/irq-suspend-timeout model and is widely used on RT and AI/HPC hosts.
It helps when: you need deterministic placement, you want to isolate receive from the app or from other softirqs, or you are running PREEMPT_RT where softirqs are already threaded. It can hurt when: the extra context switch and wakeup latency exceed the inline softirq path for a lightly loaded, well-pinned single-flow workload, or when affinity is misconfigured so the thread bounces across NUMA nodes. As with any pinning, the failure mode is putting the NAPI thread and the consuming application on the same core and serializing them.
- How does threaded NAPI change where `ksoftirqd` fits?
- Why does PREEMPT_RT essentially require threaded NAPI?
- What goes wrong if the NAPI thread shares a core with the consumer?
When would you keep traffic in the kernel stack instead of using DPDK, Onload/ef_vi, or AF_XDP? Give a decision framework rather than a slogan.staff
Kernel bypass removes generality to reduce overhead and jitter. It can avoid skb allocation, qdisc, socket wakeups, syscalls, and some scheduler effects, but it moves responsibility to the application or userspace stack: polling, memory registration, queue ownership, flow steering, security policy, observability, upgrades, and failure recovery.
I stay in the kernel when I need mature TCP behavior, congestion control, namespaces, iptables/nftables, TC, cgroups, routing, TLS/IPsec integration, standard observability, or coexistence with normal workloads. I consider bypass when the app can own cores and queues, tolerate polling, control memory layout, and benefits from direct descriptors or a userspace TCP stack. Onload is attractive when preserving the BSD sockets API matters (it intercepts libc, so unmodified apps accelerate and transparently fall back to the kernel for unsupported operations); ef_vi or DPDK fit when the application owns packet processing. AF_XDP is a middle ground that keeps Linux as the control plane but still needs careful UMEM/ring management.
Decide on p99.9 latency, CPU-per-packet, recovery after link reset, operational tooling, and how often the app falls back to kernel fds. A bypass path that wins a microbenchmark can lose in production if it breaks epoll mixing, NUMA locality, or incident debugging. Given AMD/Solarflare's lineage, knowing exactly where Onload's API compatibility ends and ef_vi's raw model begins is the substance behind the slogan.
- What production failure modes get harder with bypass?
- How do you handle ARP, routing, and link reset in a bypass design?
- Why can mixing accelerated and unaccelerated fds in one epoll set hurt latency?