๐งฉLinux NIC Driver Lifecycle
First-screen senior questions for explaining the Linux PCI Ethernet driver lifecycle honestly: studied and traced, not shipped, while bridging to embedded C, firmware, registers, interrupts, and HW/SW debug experience.
Walk me through what a Linux PCI NIC driver does in probe().
I would frame this honestly: I have studied and traced this path, but I have not shipped a Linux Ethernet NIC driver. My mental model is that probe() is where the PCI core has matched vendor/device IDs, and the driver takes ownership of the device enough to make it visible as a network interface.
The usual sequence I would expect is:
- Enable the PCI function with
pci_enable_device_mem()orpci_enable_device()and check failure. - Request the BAR resources, usually with managed helpers like
pcim_iomap_regions()or explicitpci_request_regions()pluspci_iomap(). - Set DMA capability with
dma_set_mask_and_coherent()or PCI wrappers, because the device will DMA descriptors and packet buffers. - Allocate and initialize
struct net_device, attach private state withnetdev_priv(), and setnetdev_opsandethtool_ops. - Allocate queue/ring state, map MMIO registers, set up MSI-X vectors, and bind per-queue IRQ/NAPI structures.
- Register the interface with
register_netdev()only after the object is ready enough for user space to see it.
static int mynic_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
struct net_device *ndev;
struct mynic *priv;
int err;
err = pcim_enable_device(pdev);
if (err)
return err;
err = pcim_iomap_regions(pdev, BIT(0), "mynic");
if (err)
return err;
err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
if (err)
return err;
ndev = alloc_etherdev_mq(sizeof(*priv), MYNIC_TXQS);
if (!ndev)
return -ENOMEM;
SET_NETDEV_DEV(ndev, &pdev->dev);
priv = netdev_priv(ndev);
priv->pdev = pdev;
priv->bar0 = pcim_iomap_table(pdev)[0];
ndev->netdev_ops = &mynic_netdev_ops;
ndev->ethtool_ops = &mynic_ethtool_ops;
pci_set_drvdata(pdev, ndev);
err = register_netdev(ndev);
if (err)
free_netdev(ndev);
return err;
}
The bridge to my real background is the bring-up discipline. In MediaTek firmware work I was not registering Linux netdevs, but I was repeatedly doing the same class of work: read the spec, map registers to C structures or accessors, initialize hardware blocks in the right order, reason about ownership between MCU/DSP/host, and debug failures where one missed bit or ordering assumption breaks the whole path.
- Why is `register_netdev()` near the end rather than the beginning?
- What is the difference between PCI config space and BAR MMIO space?
- What cleanup is needed if one middle step fails?
What happens in `ndo_open()` and `ndo_stop()`?
I understand probe() as device discovery and object registration, while ndo_open() is the operational bring-up when the interface is administratively brought up, for example by ip link set dev ethX up. ndo_stop() is the symmetric path when the interface goes down.
In ndo_open() I would expect the driver to allocate or initialize runtime rings if not already done, fill RX descriptors with DMA-mapped buffers, request or enable IRQs, enable NAPI, program device registers for queue base addresses and interrupt moderation, start the PHY or link management path, enable DMA engines, and finally wake the TX queues with netif_start_queue() or per-subqueue equivalents.
In ndo_stop() I would expect the reverse: stop the stack from submitting TX with netif_tx_disable(), disable interrupts, disable NAPI, stop DMA engines, drain or clean completions, unmap/free RX and TX buffers, and put the link/MAC into a quiet state.
static int mynic_open(struct net_device *ndev)
{
struct mynic *priv = netdev_priv(ndev);
mynic_refill_rx(priv);
napi_enable(&priv->q[0].napi);
mynic_enable_irqs(priv);
mynic_start_dma(priv);
netif_start_queue(ndev);
return 0;
}
static int mynic_stop(struct net_device *ndev)
{
struct mynic *priv = netdev_priv(ndev);
netif_tx_disable(ndev);
mynic_disable_irqs(priv);
napi_disable(&priv->q[0].napi);
mynic_stop_dma(priv);
mynic_free_rings(priv);
return 0;
}
I have not shipped this Linux path, but the operational pattern is familiar from embedded C: bring a hardware pipeline from reset to running state, then shut it down in an order that prevents the hardware from touching stale memory or raising interrupts after software state has gone away.
- Why should TX queues be stopped before tearing down rings?
- Where would NAPI be enabled and disabled?
- Which operations belong in `probe()` versus `ndo_open()`?
Explain the TX path from the kernel stack to the NIC.
At first-screen depth, I describe TX as the stack handing the driver an skb through ndo_start_xmit(). The driver chooses a TX ring, checks descriptor space, maps packet data for DMA, writes descriptors the NIC understands, updates the producer index, and rings a doorbell MMIO register so the device knows work is available.
If the ring is full, the driver stops that netdev queue with netif_stop_subqueue() or similar and returns NETDEV_TX_BUSY only in cases where it did not consume the skb. If it consumes the skb, ownership transfers: later the TX completion path unmaps DMA, frees the skb, updates stats/BQL, and wakes the queue if descriptors are available again.
static netdev_tx_t mynic_start_xmit(struct sk_buff *skb, struct net_device *ndev)
{
struct mynic *priv = netdev_priv(ndev);
struct mynic_txq *txq = &priv->txq[skb_get_queue_mapping(skb)];
dma_addr_t dma;
if (!mynic_tx_has_space(txq)) {
netif_stop_subqueue(ndev, txq->idx);
return NETDEV_TX_BUSY;
}
dma = dma_map_single(&priv->pdev->dev, skb->data, skb_headlen(skb), DMA_TO_DEVICE);
if (dma_mapping_error(&priv->pdev->dev, dma)) {
dev_kfree_skb_any(skb);
return NETDEV_TX_OK;
}
mynic_fill_tx_desc(txq, dma, skb_headlen(skb), skb);
dma_wmb();
writel(txq->prod, priv->bar0 + MYNIC_TX_DOORBELL(txq->idx));
return NETDEV_TX_OK;
}
My honest bridge is that I have written embedded C around TX DSP firmware and hardware-facing queues, not Linux skb code. The reasoning transfers at the descriptor and ownership level: who owns the buffer, what register tells hardware to start, what memory ordering is required, and how completion returns resources.
- When does the driver free the skb?
- Why is a DMA barrier needed before the doorbell?
- What happens if the TX ring is full?
Explain RX with NAPI and the poll budget.
RX is usually interrupt plus polling. The NIC raises an interrupt for a queue. The IRQ handler does minimal work: acknowledge or mask the queue interrupt and schedule NAPI. Then the NAPI poll method drains RX completions up to budget, builds or attaches packet data to skbs or pages, passes packets upward with napi_gro_receive() or related APIs, refills RX descriptors, and either completes NAPI or asks to be called again.
The budget is important. TX completions may be cleaned without consuming RX budget, but RX packet processing should not exceed the budget. If work remains, the poller returns exactly budget. If the queue is drained, it calls napi_complete_done() and unmasks interrupts.
static int mynic_poll(struct napi_struct *napi, int budget)
{
struct mynic_q *q = container_of(napi, struct mynic_q, napi);
int work = 0;
mynic_clean_tx(q);
while (work < budget && mynic_rx_complete(q)) {
struct sk_buff *skb = mynic_build_skb(q);
napi_gro_receive(napi, skb);
mynic_refill_one_rx(q);
work++;
}
if (work < budget && napi_complete_done(napi, work))
mynic_unmask_queue_irq(q);
return work;
}
I have not shipped NAPI code, but I have debugged embedded interrupt and firmware pipelines where the same principle matters: do the smallest safe work in interrupt context, batch the heavier processing, and maintain a strict ownership contract for buffers shared with hardware.
- Why not process all RX packets directly in the IRQ handler?
- What should the driver return if it used the whole budget?
- What does RX refill mean?
How would you describe a NIC reset or recovery path?
I would describe reset as a controlled teardown and rebuild of the datapath without pretending I have shipped one. The priority is to stop new work, make hardware stop touching memory, rebuild known-good state, and restart without leaking buffers or leaving the stack with false link/queue state.
The rough sequence is:
- Quiesce software entry points: stop TX queues and mark the device resetting.
- Disable or mask interrupts and disable NAPI so polling does not race the reset.
- Stop DMA engines and wait for hardware idle if the device provides a bit for that.
- Clean or invalidate rings, unmap/free stale buffers as appropriate.
- Reset the MAC/NIC function or relevant hardware block.
- Reprogram registers, queue base addresses, interrupt moderation, RSS, offloads, and link state.
- Refill RX rings, re-enable DMA, NAPI, interrupts, and TX queues.
The part I would be careful about is not assuming reset is just writing a reset bit. It is a state machine across Linux queues, DMA memory, MMIO registers, interrupts, and link reporting.
This maps well to my real experience with MCU-DSP and TX firmware bugs. I have had to reason from symptoms through hardware state, firmware sequencing, and shared-memory/register contracts. I would be honest that the Linux NIC APIs are newer to me, but the recovery thinking is close to work I have actually done.
- What can go wrong if DMA is not stopped before freeing buffers?
- How would you prevent `ndo_start_xmit()` racing reset?
- What state should be restored after reset?
What is the remove or shutdown cleanup path?
Remove is the permanent detach path. Shutdown is similar in spirit but often targeted at system poweroff or reboot behavior. The driver must assume user space and the networking stack may still know the interface until it unregisters it, so ordering matters.
The high-level cleanup I would expect is: unregister the netdev, stop queues and datapath if still running, disable interrupts/NAPI, stop DMA, free IRQ/MSI-X vectors, unmap and free DMA rings and packet buffers, release MMIO mappings/BAR regions, clear driver data, free the netdev, and disable the PCI device if using unmanaged APIs.
The Linux PCI documentation explicitly calls out disabling IRQ generation, freeing IRQs, stopping DMA, releasing DMA buffers, unregistering from subsystems like netdev, releasing MMIO/IO resources, and disabling the device. Managed devm_ or pcim_ helpers can reduce manual cleanup, but I would still reason about the conceptual order.
My embedded-C bridge is teardown hygiene. In firmware, the equivalent mistakes are leaving an interrupt enabled after its state is freed, or letting a DSP/MCU continue writing to a buffer that software has repurposed. I have debugged that class of ownership and lifetime issue, even though the Linux netdev cleanup APIs are not something I have shipped.
- What is the difference between `unregister_netdev()` and `free_netdev()`?
- Why stop DMA before releasing coherent memory?
- How do managed PCI helpers change the cleanup code?
Where do `ethtool_ops`, stats, and link settings fit?
ethtool_ops is the driver-facing hook for user-visible NIC controls and reporting. I would place it alongside netdev_ops during netdev setup, not as part of the hot TX/RX datapath. It exposes things like driver info, link modes, pause/FEC depending on device support, ring sizes, channels, coalescing, RSS, offloads, timestamping, and private or standard stats.
For stats, I would separate fast-path counters from user query formatting. The datapath updates per-queue counters carefully, often per-CPU or protected to avoid contention. ethtool -S or rtnetlink stats read and aggregate those counters into a stable report.
For link settings, modern drivers may interact with phylink, PHY, firmware, or MAC registers depending on the hardware. The driver should report what the NIC actually supports and what the link partner negotiated, not just hard-code a speed.
The honest bridge is that I have worked deeply with wireless PHY configuration and counters, not Ethernet PHY management. I can reason about capability reporting, negotiated hardware state, and register-backed stats from my 3GPP-to-C and TX DSP firmware work, but I would not claim I have owned Linux ethtool support in production.
- What is the difference between netdev stats and `ethtool -S` stats?
- Why should ethtool settings reflect hardware capability?
- What kinds of bugs appear in stats code?
Explain BARs, MMIO, and PCI config space in a NIC driver.
PCI config space is the standardized control/discovery space for the PCI function: vendor/device IDs, command/status, BAR registers, capabilities like MSI-X or PCIe capabilities, and power/error features. The driver should use PCI core helpers rather than directly inventing addresses from config space.
BARs are Base Address Registers. They describe device address windows assigned by firmware/kernel. A NIC BAR commonly exposes MMIO registers: queue doorbells, interrupt masks, status registers, reset controls, admin queues, and device-specific blocks. The driver requests the BAR region and maps it into kernel virtual address space, then uses readl() and writel() style accessors.
MMIO is not normal RAM. Reads and writes can have ordering and side-effect rules. A write to a doorbell register may cause hardware to fetch descriptors. A read from a status register may acknowledge or observe device state. That means register access should follow the hardware spec and Linux accessor rules.
This is one of the strongest bridges from my background. I have not shipped a PCI NIC driver, but I have spent a lot of time with hardware registers, bit fields, firmware-visible state machines, and bugs caused by subtle sequencing between C code and hardware behavior.
- Why use `readl()`/`writel()` instead of dereferencing a pointer?
- What information lives in PCI capabilities?
- What kind of registers might live behind a NIC BAR?
How do MSI-X vectors, per-queue IRQs, and affinity work?
For a modern multi-queue NIC, MSI-X lets the device raise multiple independent interrupt vectors instead of one shared legacy interrupt. A common design is one vector per RX/TX queue pair, plus maybe a separate admin or event vector. Each vector has an IRQ handler that schedules the NAPI instance for that queue.
Per-queue vectors reduce lock contention and improve locality. If queue 3 receives traffic, only queue 3's interrupt and NAPI poller need to run. IRQ affinity can place that vector on a CPU close to the application or aligned with RSS/XPS/RPS policy. For low-latency NICs, this becomes very practical: CPU placement, queue selection, and interrupt moderation affect tail latency.
The driver setup usually allocates vectors with pci_alloc_irq_vectors() using PCI_IRQ_MSIX, requests IRQs per vector, stores the mapping from vector to queue, and frees them on teardown.
I would be transparent that I have studied this in Linux NICs rather than shipped it. The bridge is that I have debugged interrupt-driven embedded systems and PHY/DSP paths where CPU placement was different, but the core issue was the same: map hardware events to the right software context with minimal latency and clear ownership.
- Why is one interrupt for the whole NIC less scalable?
- How does RSS relate to RX queue interrupts?
- What happens if interrupt moderation is too aggressive?
What practical ordering issues exist between PCIe, DMA, and MMIO?
At a practical driver level, I think in terms of visibility and ownership. Before ringing a TX doorbell, descriptors and packet data must be visible to the device. After observing a completion, the CPU must not read stale descriptor or packet data. Before freeing or reusing a buffer, the driver must know the device is done with it.
For streaming DMA, the driver maps the buffer before handing it to hardware and unmaps it after completion. For coherent descriptor rings, the memory is coherent but ordering still matters around producer/consumer indexes and MMIO doorbells. A dma_wmb() before a doorbell is the classic pattern: make descriptor writes visible before the MMIO write tells hardware to fetch them.
MMIO accessors have ordering semantics, but I would not rely on folklore. I would check the device spec and Linux DMA API rules. On weakly ordered architectures, code that accidentally works on x86 may be wrong elsewhere.
This is very close to my embedded experience. In wireless firmware and HW/SW integration, many bugs are not pure algorithm bugs; they are ordering, ownership, and visibility bugs between cores, DSP, DMA-like engines, and registers. I can reason about that class of failure, while being clear that Linux PCIe DMA API details are an area I have studied rather than owned in production.
- Why is `volatile` not the right answer for DMA ordering?
- What is the difference between coherent and streaming DMA memory?
- Where would you put a barrier in TX?
What happens during PCIe AER or PCI error handling?
AER is PCIe Advanced Error Reporting. The root port infrastructure detects and reports errors, and endpoint drivers can participate in recovery through PCI error handler callbacks. I would not claim production experience here; my understanding is from studying the kernel model and tracing driver patterns.
The driver may observe callbacks such as error_detected, mmio_enabled, slot_reset, and resume, depending on the recovery flow and severity. The driver's job is to stop trusting the device, quiesce IO, stop DMA if possible, report whether it can recover, reinitialize after reset, restore PCI/device state, and restart queues only when the PCI core indicates it is safe.
static pci_ers_result_t mynic_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
{
struct net_device *ndev = pci_get_drvdata(pdev);
netif_device_detach(ndev);
mynic_quiesce(ndev);
if (state == pci_channel_io_perm_failure)
return PCI_ERS_RESULT_DISCONNECT;
return PCI_ERS_RESULT_NEED_RESET;
}
static pci_ers_result_t mynic_slot_reset(struct pci_dev *pdev)
{
struct net_device *ndev = pci_get_drvdata(pdev);
if (pci_enable_device_mem(pdev))
return PCI_ERS_RESULT_DISCONNECT;
pci_set_master(pdev);
mynic_reinit(ndev);
return PCI_ERS_RESULT_RECOVERED;
}
The bridge to my actual work is failure recovery across hardware and software boundaries. I have debugged MCU-DSP issues and many integration defects where the first task is to stop the bleeding, preserve evidence, rebuild known state, and avoid corrupting shared memory. That mindset applies directly, even though the Linux AER callback API is not a shipped experience for me.
- What is `netif_device_detach()` used for?
- Why should the driver stop DMA during error recovery?
- What does permanent failure mean for a NIC?
How would you debug a NIC that probes but cannot pass traffic?
I would split it into layers rather than guessing. First, confirm PCI discovery and resources: lspci -vv, driver bound, BARs present, MSI-X enabled, no obvious AER spam in dmesg. Then confirm the netdev exists and is administratively up with ip link, carrier state, and ethtool link reporting.
Next I would inspect the driver lifecycle: did ndo_open() run, are IRQs requested, are NAPI instances enabled, are RX rings refilled, are TX queues awake, and are there errors in stats? I would compare software counters with hardware counters: TX packets submitted, TX completions received, RX descriptors posted, RX completions observed, drops, DMA mapping failures, and interrupt counts in /proc/interrupts.
For TX-only failure, I would look at descriptors, DMA mapping, doorbell writes, queue stop/wake behavior, and completions. For RX-only failure, I would check link/MAC filters, RX buffer refill, interrupt/NAPI scheduling, and whether packets are dropped before reaching the stack.
This is where I would lean on my real strength. At MediaTek I worked through many hardware/software integration issues, including firmware and DSP paths, and closed large numbers of defects by moving systematically across logs, counters, register state, and code paths. I would apply that method here while being clear that I am ramping on Ethernet-specific tools and driver internals.
- What would you check in `/proc/interrupts`?
- How do you distinguish no IRQ from IRQ but no NAPI progress?
- Which counters would you add if the existing stats were insufficient?
What do you know well here, and what would you need to ramp on?
I would be direct: I have not shipped a Linux NIC driver, and my protocol depth is stronger in wireless PHY than Ethernet and TCP/IP. What I can offer immediately is solid embedded C, register-level reasoning, interrupt and firmware pipeline debugging, spec-to-code translation, and experience tracing hard HW/SW bugs through real products.
For the Linux NIC lifecycle, I have studied the main shape: PCI probe() and remove, BAR/MMIO setup, DMA masks and descriptor rings, net_device_ops, ndo_open() and ndo_stop(), TX through ndo_start_xmit(), RX through NAPI, MSI-X queues, ethtool reporting, reset/recovery, and PCIe AER handling. I can reason about these paths and read existing drivers, but I would not present that as shipped production ownership.
My ramp plan would be practical:
- Trace one AMD/Solarflare-style driver path from PCI ID match to
register_netdev(). - Trace one TX packet from
ndo_start_xmit()to completion cleanup. - Trace one RX packet from interrupt to NAPI to GRO/stack handoff.
- Build small experiments around DMA mapping, queue stop/wake, NAPI budget behavior, and ethtool stats.
- Pair that with protocol ramp on Ethernet link, RSS, checksum/TSO/GRO, TCP basics, and low-latency tuning.
That is the honest story I would tell in a first screen: I am not hiding the gap, but the gap is learnable because the underlying HW/SW discipline is exactly the kind of work I have done.
- Which existing driver would you read first and why?
- What Linux networking concept is highest priority for your ramp?
- How would your wireless PHY experience help on NIC driver work?