๐ ๏ธLinux NIC Driver Craft
Probes whether a senior driver engineer can build and debug a NIC driver across PCI discovery, DMA ownership, NAPI, queues, locking, reset, and user-visible controls.
Post descriptors, ring the doorbell, watch the NIC DMA and complete (RX/TX).
wmb()), then the posted doorbell write. Completions are learned from the DMA'd OWN/DD bit in host memory โ never by polling a NIC register.Design the probe and remove paths for a PCIe Ethernet driver. What must be initialized before `register_netdev()`, and what ordering mistakes cause real production bugs?senior
Probe runs in process context and may sleep. A typical path is: match struct pci_device_id, pci_enable_device(), pci_request_regions(), set the DMA mask with dma_set_mask_and_coherent(), pci_set_master() to enable bus mastering, map BARs with pci_iomap() or managed variants, allocate struct net_device with enough queues via alloc_etherdev_mqs()/alloc_netdev_mqs(), initialize private state, allocate MSI-X vectors with pci_alloc_irq_vectors(), set netdev_ops, ethtool_ops, feature flags, NAPI instances via netif_napi_add(), queue counts (netif_set_real_num_{rx,tx}_queues()), and default ring/coalesce parameters. Only then register_netdev().
The key rule is that after register_netdev() the interface is visible; userspace can open it, change MTU, query ethtool, attach XDP, or start Tx. Anything those callbacks touch must already be initialized. Remove should unregister_netdev() first so the core closes the device and blocks new users, then disable NAPI/IRQs/DMA, free rings/vectors, unmap BARs, release PCI resources, and free_netdev() at the correct lifetime point.
Production bugs include an ethtool callback dereferencing uninitialized ring arrays, an open callback racing probe-tail work, DMA continuing after remove, calling free_netdev() too early while sysfs still references the object, and forgetting that hot-unplug/remove also runs in process context but the hardware may no longer respond to MMIO (reads return all-ones).
- When would you prefer `devm_`/`pcim_` managed resources, and when does managed cleanup ordering bite you?
- Where do you call `pci_set_drvdata()`, and why before enabling IRQs?
- How do SR-IOV VFs change the remove/reset story?
Walk through setting up Rx and Tx descriptor rings using the Linux DMA API. Include coherent memory, streaming mappings, IOMMU effects, and cleanup on partial failure.staff
Descriptor rings are usually dma_alloc_coherent() because both CPU and device read/write producer/consumer-visible descriptors. Packet buffers are usually streaming mappings: dma_map_single(), dma_map_page(), or skb-frag mappings for Tx, and page/page_pool mappings for Rx. The device receives a dma_addr_t, not a CPU virtual or physical address. With an IOMMU that address is an IOVA unrelated to physical memory, so a driver must never cast or arithmetically derive it from a page.
Setup order: allocate the software ring, allocate coherent descriptor memory, initialize descriptors CPU-owned, allocate/map Rx buffers, program ring base addresses and lengths into BAR registers, initialize producer/consumer indices, then enable queue DMA. On any failure, unwind in reverse: unmap streaming buffers already mapped, free pages/skbs, free coherent memory with the same size and DMA handle, and leave hardware quiesced.
For Tx, every mapped segment needs an unmap on completion or on error before ownership is forgotten. If dma_mapping_error() fails partway through a multi-frag skb, unmap the segments already mapped and fail without leaking or freeing an skb the stack still owns:
for (i = 0; i < nr_frags; i++) {
dma = skb_frag_dma_map(dev, &frags[i], 0, len, DMA_TO_DEVICE);
if (dma_mapping_error(dev, dma))
goto unwind;
/* fill descriptor i */
}
return 0;
unwind:
while (--i >= 0)
dma_unmap_page(dev, desc[i].dma, desc[i].len, DMA_TO_DEVICE);
/* do NOT kfree_skb here; return NETDEV_TX_OK after dev_kfree_skb_any,
or drop and free exactly once per your xmit contract */
Subtle bugs: assuming coherent descriptors remove the need for ordering barriers, missing DMA sync on non-coherent buffer data, overflowing descriptor address fields with 64-bit DMA addresses on limited hardware, and failing only when an IOMMU is enabled.
- Why must modern drivers use `dma_map_*()` rather than the deprecated `pci_map_*()`?
- What does `dma_set_mask_and_coherent()` protect you from on a 32-bit-DMA device?
- How would DMA API debug (`CONFIG_DMA_API_DEBUG`) catch a Tx unmap leak?
A NIC occasionally transmits stale packet data after a driver optimization. Explain the memory ordering needed around descriptor writes, ownership bits, and MMIO doorbells.staff
The CPU must make descriptor contents visible to the device before it marks the descriptor device-owned or rings the doorbell. On weakly ordered systems, plain stores to coherent memory can be observed out of order by the device. The kernel memory-barrier documentation gives this exact pattern: write descriptor fields, dma_wmb(), publish the ownership/valid bit, then use an MMIO accessor such as writel() to notify the device.
On completion, if the device clears ownership or writes a status word before the data fields are settled, the CPU needs dma_rmb() after observing the ownership/status change and before reading descriptor data. dma_* barriers order shared memory with the device; they are not a substitute for MMIO accessor ordering or device-specific flush rules.
A compact Tx publish sequence:
/* Fill all fields while the CPU owns the descriptor. */
desc->addr = cpu_to_le64(dma);
desc->len = cpu_to_le16(len);
desc->flags = flags;
/* Ensure fields reach memory before ownership is visible to the device. */
dma_wmb();
desc->ctrl = cpu_to_le32(DESC_OWN | DESC_EOP);
/* Notify hardware after publishing descriptors. */
writel(ring->prod, ring->doorbell);
I would check whether the optimization changed writel() to writel_relaxed() (which drops the ordering against prior normal stores), batched many descriptors but dropped the dma_wmb() before publishing the last ownership bit, wrote a valid bit before the address/length, or reused a DMA buffer before Tx completion confirmed the device was done. A common batching bug: publishing N descriptors with one trailing doorbell but no barrier between the data writes and the head update, so the device reads a descriptor whose address field is still stale.
- When is `writel_relaxed()` safe, and what must you add to keep correctness?
- Why can this pass on x86 (TSO) and fail on Arm/PowerPC?
- How would you prove stale data is an ordering bug rather than buffer reuse?
Implement the structure of a NAPI poll method for a NIC queue. What are the exact completion, budget, and IRQ re-enable gotchas?senior
A NAPI poll reclaims Tx completions, processes up to budget Rx packets, and returns Rx work done. If work done is less than budget, it attempts completion with napi_complete_done() and re-enables the queue interrupt only if that returns true. If work equals budget, it returns budget and stays scheduled (it must NOT call complete or unmask).
The critical detail: napi_complete_done() returns false when the core saw a missed event and rescheduled the poll. If you unmask the IRQ anyway, you can take an interrupt for work the still-scheduled poll will also process, or worse race the re-arm. So unmask is strictly gated on the true return.
Other gotchas:
- Tx cleanup should not be capped by the Rx budget, but Rx/XDP/page_pool APIs must not run when budget is 0 (the netpoll/
netconspath passes 0). - The poll must not sleep; it runs in softirq (or a NAPI thread) context.
- Flush GRO through the NAPI APIs, not ad hoc delivery.
- Keep per-queue state cache-local; avoid global locks.
Skeleton:
static int my_poll(struct napi_struct *napi, int budget)
{
struct my_q *q = container_of(napi, struct my_q, napi);
int work = 0;
my_clean_tx(q); /* not limited by budget */
if (budget)
work = my_clean_rx(q, budget);
if (work < budget) {
if (napi_complete_done(napi, work))
my_unmask_irq(q); /* only when complete succeeded */
}
return work;
}
The IRQ handler should mask or ack the queue interrupt, then schedule via napi_schedule_prep() + __napi_schedule() (or napi_schedule()); if prep says it is already scheduled, leave ownership to the running poll.
- What exactly does a false return from `napi_complete_done()` mean?
- Why is `napi_disable()` not idempotent, and what does it wait for?
- How do you handle a queue reset while NAPI is scheduled?
How would you expose ring size and interrupt coalescing through `ethtool_ops`, and what constraints must the driver enforce?senior
Implement the modern callbacks (get_ringparam/set_ringparam, get_coalesce/set_coalesce), which the netlink ethtool layer maps to the classic ETHTOOL_GRINGPARAM/ETHTOOL_SRINGPARAM and ETHTOOL_GCOALESCE/ETHTOOL_SCOALESCE semantics. Ring reporting should include max and current Rx/Tx sizes, and optionally buffer length, CQE size, or header/data-split attributes if the hardware supports them. For coalescing, the core checks ethtool_ops.supported_coalesce_params and rejects any field the driver did not advertise โ so you declare exactly which usecs/frames/adaptive bits are real, and the core enforces it for you.
Set callbacks must validate hardware limits, alignment, power-of-two constraints, the minimum descriptor count for worst-case fragments, memory pressure, XDP/AF_XDP restrictions, and whether the interface is up. Some changes require closing queues, freeing and reallocating rings, reprogramming hardware, and reopening without losing netdev state. Others apply live per queue.
Reject unsupported attributes rather than silently accepting them, and make changes atomic from the user's view: if resizing Tx succeeds and Rx fails, roll back or leave the device in a documented unchanged state โ never half-resized. Stats should distinguish drops due to ring exhaustion, allocation failure, and coalescing-induced delayed service so an operator can tell why packets were lost.
- Can ring resize be allowed while XDP zero-copy (AF_XDP) is bound to a queue?
- What does `supported_coalesce_params` save you from implementing?
- How should adaptive coalescing interact with an explicit user usecs setting?
Choose locking primitives for a NIC driver: Tx fast path, stats, link state, reset, ethtool reconfiguration, and filter tables. Where do `spin_lock_irqsave()`, mutexes, and RCU fit?staff
Fast paths in hardirq, softirq, or BH-disabled context need non-sleeping synchronization. Per-Tx-queue locks are often spinlocks (__netif_tx_lock is taken by the core around ndo_start_xmit), sometimes avoided with per-CPU/queue ownership. If one lock is taken from both hardirq and process context, use spin_lock_irqsave() in process context, or design so hardirq never takes it. NAPI poll runs in softirq, so spin_lock_bh() is appropriate against process-context paths that touch the same data.
Mutexes fit slow paths: ethtool ring resize, firmware mailbox commands that sleep, link-mode changes, reset orchestration, probe/remove. Never hold a mutex from NAPI or hardirq. RCU fits read-mostly tables โ filter rules, indirection snapshots, netdev private pointers โ where fast-path readers need a stable view while updates replace the object and free after a grace period.
Stats must avoid a single global lock. Use per-queue counters and u64_stats_sync for 64-bit counters; aggregate in ndo_get_stats64/ethtool. Reset needs a state bit or refcount protocol so fast paths stop admitting new work, drain NAPI/Tx, and avoid racing ethtool.
The hard part is not picking primitives; it is documenting lock ordering (e.g. rtnl_lock outermost, then driver mutex, then per-queue spinlock) and proving no path sleeps holding a spinlock or enters reset while a queue still owns DMA buffers. lockdep validates the ordering; it cannot prove the DMA-ownership invariant โ that is on you.
- When is `spin_lock_bh()` insufficient and you need `irqsave`?
- How do you free an RCU-protected filter object (`call_rcu` vs `synchronize_rcu`)?
- How would lockdep find an rtnl-vs-driver-mutex reset deadlock?
Per-queue 64-bit counters look trivial but break subtly on 32-bit readers. Explain `u64_stats_sync`, when it is a no-op, and the write-side rule that, if violated, hangs a reader forever.senior
A 64-bit counter on a 32-bit CPU is stored as two 32-bit words, so a reader can see a torn value mid-increment (low word wrapped, high word not yet bumped). u64_stats_sync wraps each counter in a seqcount-protected critical section so readers retry if a writer was in progress. Crucially it compiles to (almost) nothing on 64-bit and on UP kernels โ the seqcount machinery is elided โ so it is free where it isn't needed and there is no excuse to skip it.
Writer:
u64_stats_update_begin(&q->syncp);
u64_stats_add(&q->rx_packets, 1);
u64_stats_add(&q->rx_bytes, len);
u64_stats_update_end(&q->syncp);
Reader (in ndo_get_stats64):
do {
start = u64_stats_fetch_begin(&q->syncp);
pkts = u64_stats_read(&q->rx_packets);
bytes = u64_stats_read(&q->rx_bytes);
} while (u64_stats_fetch_retry(&q->syncp, start));
The write-side rule: writers must be mutually exclusive against each other. The seqcount only tolerates one writer at a time; two concurrent update_begins corrupt the sequence and a reader can spin forever waiting for an even count. In a NIC driver this is naturally satisfied because each per-queue counter is written only by that queue's NAPI poll (single context). If you ever update the same syncp from two CPUs โ e.g. xmit and completion on different cores touching one counter โ you must add your own lock or the seqcount breaks. On a 32-bit reader, that bug presents as a hung ifconfig/ip -s link, not as a wrong number.
- Why does this need no protection at all on x86-64?
- What happens to a 32-bit reader if two writers race the same `syncp`?
- Why is `u64_stats_t` preferred over a plain `u64` even with the sync wrapper?
Describe correct netdev queue start/stop/wake behavior for a multiqueue NIC. How do you avoid both `NETDEV_TX_BUSY` and permanent queue stalls?senior
Each hardware Tx ring maps to a netdev subqueue. The driver starts queues when the device is open and rings are ready, stops a subqueue before descriptor space becomes insufficient, and wakes it after completions reclaim enough descriptors. The threshold must cover the worst skb the stack can hand down: linear segment plus MAX_SKB_FRAGS, TSO context descriptors, and any hardware metadata descriptors.
ndo_start_xmit() should not normally return NETDEV_TX_BUSY. If the queue was awake, the driver should have guaranteed room or stopped it earlier. After NETDEV_TX_OK the driver owns the skb and must eventually free it in completion reclaim. If it returns BUSY, it must not keep a reference and must not free the skb โ the qdisc requeues it.
The stop/wake pair has a lost-wakeup race (stop, then re-check descriptors with smp_mb(), then wake if the completion path already drained the ring); netif_tx_maybe_stop_queue() encapsulates it. Permanent stalls also come from over-coalesced Tx-completion interrupts, wrong MSI-X affinity, a reset that clears completions without waking queues, incomplete BQL accounting, or a wake condition that demands too many descriptors. I add per-queue counters for stop, wake, busy, completions, and descriptor availability, then correlate with ethtool -S, qdisc backlog, and socket send-buffer stalls.
- What happens if a stopped subqueue is never woken after a link flap?
- How does `netif_tx_disable()` differ from stopping one subqueue?
- Where does BQL accounting sit in the enqueue and completion flow?
How do you allocate and place MSI-X interrupts for a multiqueue NIC so each queue's IRQ, NAPI, and consuming app stay NUMA-local? What goes wrong if you get it wrong?staff
Allocate vectors with pci_alloc_irq_vectors(pdev, min, max, PCI_IRQ_MSIX), or pci_alloc_irq_vectors_affinity() when you want the core to spread vectors automatically. With PCI_IRQ_AFFINITY the kernel distributes vectors across CPUs (per-CPU if possible, else per-node) and managed IRQs follow CPU hotplug for you โ generally the right default. When you place affinity manually (e.g. to align Rx queue N with a specific app thread), pick a CPU on the device's local node with cpumask_local_spread() and apply it with irq_set_affinity_and_hint() (the modern replacement for the deprecated irq_set_affinity_hint(), which set the hint but did not actually move the IRQ).
The goal is a clean per-queue lane: the NIC DMAs into buffers on the local node, the MSI-X vector fires on a local CPU, NAPI polls there, and the socket consuming that queue runs there too. Then RSS hashing keeps a flow on one queue and the whole path stays on one node with warm caches.
What goes wrong:
- Vector fires on a remote node, so every packet's descriptor and header read crosses the interconnect; latency and tail variance jump.
irqbalancere-spreads your IRQs at runtime and undoes manual pinning โ you fight it or disable it for these vectors.- Allocating buffers without
dev_to_node(&pdev->dev)awareness puts Rx memory on the wrong node regardless of IRQ placement. - Using
irq_set_affinity_hint()and assuming the IRQ moved (it only published a hint). - More queues than local CPUs forces some queues onto a remote node; you must decide queue count per node deliberately.
- What does `PCI_IRQ_AFFINITY` automate, and why are managed IRQs nicer for CPU hotplug?
- Why did `irq_set_affinity_hint()` get deprecated?
- How do you keep `irqbalance` from undoing your placement?
A production host reports PCIe AER errors and the NIC stops passing traffic. Design an error/reset recovery path that does not corrupt memory or deadlock the network stack.staff
The recovery path is a state machine: mark the device unhealthy, stop admitting new Tx, carrier-off/detach as appropriate, disable interrupts, napi_disable()/synchronize NAPI, stop DMA engines, wait for or abort in-flight work, reclaim/unmap buffers exactly once, reset hardware or participate in PCI error recovery via struct pci_error_handlers (error_detected โ slot_reset โ resume), reinitialize rings, restore filters/RSS/MSI-X/coalescing, re-enable DMA/interrupts/NAPI, wake queues, and set carrier from link state.
Memory safety first. If hardware may still DMA, do not free or reuse buffers until DMA is quiesced or the PCI function is isolated by reset/IOMMU. If descriptors are lost across reset, software must walk its own rings and unmap/free outstanding skbs/pages. Tx completions arriving after reset must be ignored or matched against a generation counter, or you double-free.
Deadlocks come from calling reset while holding a Tx lock that NAPI cleanup needs, sleeping in atomic context, or taking rtnl/device locks in the wrong order. Have one reset owner, clear state bits for open/closing/resetting/dead, and make ethtool, MTU change, and XDP attach either serialize with reset or return -EBUSY. Note the PCI error callbacks themselves run in process context and may sleep, but MMIO during error_detected returns all-ones โ do not trust register reads until slot_reset.
I test with AER injection, FLR, surprise-remove, link flap, firmware timeout, and traffic under load while running lockdep, KASAN, DMA API debug, and queue-stall assertions.
- What is different for Function Level Reset versus full device/slot reset?
- How do you handle firmware mailbox commands that time out during reset?
- What state must be replayed after `slot_reset` returns?
How would you debug a rare Rx data corruption that appears only with IOMMU enabled, jumbo frames, and XDP redirect?staff
I assume an ownership or mapping bug until proven otherwise. The IOMMU changes DMA addresses (IOVAs) and traps some out-of-bounds accesses; jumbo frames stress multi-buffer/fragment accounting; XDP redirect changes completion timing and can keep frames alive after the original Rx queue wants to recycle them.
First isolate dimensions: disable XDP redirect, drop MTU to standard, toggle the IOMMU only as a comparison, switch the XDP action to XDP_PASS/XDP_DROP, and check whether corruption follows a queue, NUMA node, or page_pool instance. Then instrument the buffer lifecycle using unused descriptor/metadata bytes: stamp a generation counter, ring index, DMA address, page pointer, refcount, and last owner. Enable DMA API debug, page_owner/page poisoning where feasible, KASAN/KFENCE on a repro kernel, and tracepoints for map/sync/recycle/redirect-completion.
Likely bugs:
- Reusing a page fragment before
XDP_REDIRECTcompletion returns it to the pool. - Wrong DMA length for jumbo or multi-buffer packets (mapping head length, not full frame).
- Missing
dma_sync_single_for_cpu()before the CPU reads headers on a non-coherent path. - A descriptor type that still carries a physical address instead of the IOVA.
- Ring-wrap or descriptor-stride errors exposed only by larger CQEs.
- An XDP program touching frag bytes directly instead of
bpf_xdp_load_bytes()on a multi-buffer frame.
The fix must include an assertion or counter proving the bad ownership transition cannot recur, not just a timing delay that hides it.
- What would you log without perturbing the race timing?
- How can page_pool recycle/alloc stats localize this?
- Why can adding a `printk` make the bug disappear (Heisenbug)?