๐Ÿ”ŽDebugging & Observability

A NIC-debugging round for a wireless-PHY embedded engineer pivoting into AMD networking: systematic packet-drop, latency, link, DMA, and datapath triage without overstating shipped Linux NIC-driver experience.

Walk the descriptor ring the way you'd bisect a stuck queue.

MMIO doorbell idle
d0โ€”d1โ€”d2โ€”d3โ€”d4โ€”d5โ€”d6โ€”d7โ€”RX ringHEAD 0 ยท TAIL 0
HEAD 0 โ€” hardware-owned (NIC consumes) TAIL 0 โ€” driver-owned (driver posts) OWN0 โ€” handed to NIC DD1 โ€” done, awaiting reap
Ring initialized โ€” 8 descriptors, all FREE.
Barrier before the doorbell: the descriptor writes must be globally visible before the doorbell MMIO write, or the NIC can DMA a half-written descriptor. The driver issues a write barrier (wmb()), then the posted doorbell write. Completions are learned from the DMA'd OWN/DD bit in host memory โ€” never by polling a NIC register.
A host starts dropping packets after a new low-latency UDP workload is deployed. Walk me through your first hour of debugging.senior

I would start by separating where packets disappear: on the wire, in the NIC, in the driver rings, in the Linux stack, or in the application. I have not shipped a Linux NIC driver, so I would be careful to say I am applying a studied Linux/NIC debug method, similar to how I have debugged MediaTek PHY/firmware issues from traces, core dumps, KPIs, and register-level symptoms.

My first pass would be evidence, not tuning:

  • ip -s link show dev eth0 for standard RX/TX errors, drops, overruns, and carrier changes.
  • ethtool -S eth0 for driver/NIC counters: RX no-buffer, ring full, missed packets, checksum errors, pause frames, FEC errors, queue drops, and per-queue imbalance.
  • ethtool -g eth0 for ring sizes and ethtool -c eth0 for coalescing, because a small ring plus microbursts can look like random loss.
  • dmesg -T for reset, firmware, PCIe AER, DMA, IOMMU, or link messages.
  • tcpdump -i eth0 -nn -s 128 -w repro.pcap if capture overhead is acceptable, or a SPAN/TAP capture if I need wire truth.
  • switch counters on the partner port, because a host-only view can miss ingress drops before the NIC.

Then I would reproduce with the smallest workload that still drops, pin it to known queues if possible, and compare counters before and after a fixed interval. The key is to take deltas, not stare at absolute values. If only one RX queue has drops, I look at RSS/IRQ affinity and queue mapping. If all queues drop with no wire errors, I look at ring refill, NAPI budget, softirq pressure, and application receive behavior.

The way I would say it in interview is: I would not claim I have debugged this exact Ethernet driver in production, but I have debugged more than 100 HW/SW issues where the main skill was narrowing a noisy symptom to the exact layer that stopped honoring ownership, timing, or buffering.

What they're listening for: Strong answers gather host, driver, NIC, and switch evidence before changing knobs, and they avoid claiming shipped NIC-driver experience.
Follow-ups
  • Which counters would make you suspect RX ring starvation rather than wire loss?
  • Why are counter deltas more useful than absolute values?
  • When would you trust tcpdump, and when would you need an external capture?
Give me concrete commands you would run to debug packet drops on a Linux NIC.senior

I would use a repeatable snapshot so I can compare before and after the workload. A simple version is:

/* Not C code I would compile; this is the command checklist I would say out loud. */
$ ip -s link show dev eth0
$ ethtool -i eth0
$ ethtool -S eth0
$ ethtool -g eth0
$ ethtool -c eth0
$ grep eth0 /proc/interrupts
$ cat /proc/net/softnet_stat
$ dmesg -T
$ tcpdump -i eth0 -nn -s 128 'udp and port 4791'

In a real session I would replace eth0 and the filter with the actual interface and workload. I would also take the same commands immediately before and immediately after a 30-second repro window, then diff the counters.

What I am trying to learn is:

  • Driver identity and version from ethtool -i, so I know which counter names to interpret.
  • Whether drops are standard netdev drops or driver-private drops.
  • Whether one queue is hot from interrupts and per-queue stats.
  • Whether softnet backlog drops or time_squeeze are increasing, which points above the NIC into CPU/softirq budget pressure.
  • Whether the kernel logged DMA, IOMMU, PCIe, reset, or link events.

I would be explicit that this is a studied Linux NIC observability workflow. My shipped experience is closer to embedded trace triage: I used core dumps, firmware traces, KPIs, and repro minimization to close modem/PHY issues, and I would transfer that discipline to Linux NIC counters and tracepoints.

What they're listening for: The candidate should know the command vocabulary and what each command proves. The code fence is intentionally a compact checklist, not a claim of driver authorship.
Follow-ups
  • What does `/proc/net/softnet_stat` add beyond `ethtool -S`?
  • Why is `ethtool -i` useful before interpreting counters?
  • How would you reduce tcpdump perturbation on a high-rate workload?
A low-latency service has p99 latency spikes, but average throughput is fine. How would you debug whether the NIC path is involved?senior

I would treat this as a timeline problem. A p99 spike means some stage occasionally waits too long, so I want timestamps and queue depth at each stage: application, socket, softirq/NAPI, driver ring, DMA/NIC, and wire.

I would check:

  • IRQ affinity and whether RX/TX queue interrupts land on the same NUMA node as the application.
  • Interrupt moderation with ethtool -c; coalescing can improve throughput while adding microseconds to tail latency.
  • Per-queue packet distribution from ethtool -S, RSS indirection from ethtool -x, and queue mapping with /proc/interrupts.
  • CPU isolation, frequency governor, C-states, and whether ksoftirqd is running on the same core as the app.
  • perf top or perf record around spikes, looking for softirq, driver poll, scheduler, copy, checksum, or lock contention.
  • ftrace/trace-cmd events around IRQ, NAPI poll, sched switch, and net receive if perf is too coarse.
  • Busy-poll options only after I understand CPU budget, because busy polling can reduce wakeup latency but burns cores and can hurt neighbors.

The mental model I would say out loud is: throughput says the system can drain the average load; p99 asks whether a queue, interrupt, scheduler event, or cache/NUMA transition sometimes delays a packet. That is very close to my MediaTek debugging style: in PHY/firmware issues, I learned to correlate KPI spikes with exact trace windows rather than tune blindly.

What they're listening for: Strong answers connect latency spikes to queueing and scheduling, and mention coalescing, IRQ affinity, CPU isolation, busy polling, perf, ftrace, softirq load, and queue mapping.
Follow-ups
  • How can interrupt coalescing improve throughput while hurting p99?
  • What would make you suspect `ksoftirqd` rather than the NIC?
  • When is busy polling a good latency tradeoff?
What would you look at if latency spikes correlate with `NET_RX` softirq load?senior

I would first confirm whether the RX path is exceeding its CPU or budget. I have studied NAPI as the mechanism where an interrupt schedules polling, and packet work runs in softirq context or in a NAPI thread depending on configuration. I would not present myself as someone who implemented this in a production NIC, but I can reason about the failure modes.

I would inspect:

  • /proc/net/softnet_stat per CPU, especially drops and budget/time-squeeze style symptoms.
  • /proc/interrupts to see whether one MSI-X vector is overloaded.
  • ethtool -S per-queue packet and drop counters, looking for one hot queue.
  • RSS indirection and hash behavior with ethtool -x and flow steering rules if available.
  • RPS/RFS/XPS settings under /sys/class/net/eth0/queues/*, because software steering can help or make cache movement worse.
  • perf top -g during the spike, looking for NAPI poll, GRO, skb allocation, copy, checksum, or driver functions.
  • ftrace or trace-cmd around napi_poll, IRQ entry/exit, and scheduler switches if I need ordering.

The main decision is whether the fix belongs in hardware steering, CPU placement, queue count, coalescing, NAPI budget, application receive behavior, or workload shaping. I would avoid broad changes like enabling RPS on all CPUs until I know the cost, because moving packets across CPUs can add IPIs and cache misses.

What they're listening for: This tests whether the candidate knows softirq/NAPI observability and does not treat RPS as free parallelism.
Follow-ups
  • Why can RPS make a low-latency workload worse?
  • What does one hot RX queue suggest?
  • How would you prove a spike is scheduler-related?
A link is flapping or comes up at the wrong speed. How would you debug link issues without jumping straight to driver blame?senior

I would start with the physical and link-negotiation layer, because many link problems are outside the driver. My first checks would be link state, negotiated speed/duplex, autonegotiation, FEC mode, optics or cable identity, and the switch-port view.

I would use:

  • ethtool eth0 for link detected, speed, duplex, autonegotiation, advertised modes, and partner-advertised modes.
  • ethtool -m eth0 for optics diagnostics when the module supports it, such as temperature, power, and vendor fields.
  • ethtool -S eth0 for physical-layer style counters: CRC/FCS errors, symbol errors, alignment errors, PCS/FEC counters, local/remote faults, and link resets.
  • Switch counters and logs for the connected port, because the partner can tell a different story.
  • dmesg -T for link up/down, firmware, module, and reset messages.

If autonegotiation or FEC is suspected, I would compare both sides explicitly rather than forcing settings blindly. On high-speed Ethernet, FEC mismatch or marginal optics can present as intermittent packet loss or latency before a clean link-down event. If counters show physical errors on both host and switch, I would swap cable/optic/port as a controlled experiment.

My interview framing would be: I have deeper shipped experience in wireless PHY than Ethernet PHY, so I would not pretend I know every Ethernet PCS/FEC detail from production. But the debugging posture is familiar: link training, negotiated capability, error counters, partner view, and controlled swaps are the same kind of HW/SW boundary reasoning I used in modem work.

What they're listening for: A strong answer checks link partner, optics/cable, autoneg/FEC, and error counters before blaming software.
Follow-ups
  • What counters would distinguish a bad cable from RX ring pressure?
  • Why does the switch-port view matter?
  • When is forcing speed/FEC useful, and when is it dangerous?
How would you investigate suspected DMA or IOMMU faults in a NIC driver path?senior

I would treat DMA faults as an ownership and address-translation problem. The NIC must DMA to a DMA address returned by the Linux DMA API, not a CPU virtual address and not a guessed physical address. With an IOMMU, that DMA address is an IOVA, so bad lifetime, length, direction, or unmap behavior can show up as IOMMU faults.

I would check:

  • dmesg -T for IOMMU fault logs, DMA API debug warnings, PCIe AER, device reset, or invalid descriptor messages.
  • Whether the driver maps every TX fragment and RX buffer with the correct direction and length.
  • Whether mappings are unmapped exactly once, after the device is done.
  • Whether descriptor addresses are dma_addr_t values returned by dma_map_* or dma_alloc_coherent, not derived from pointers.
  • Whether a ring wrap, jumbo frame, scatter-gather segment, or XDP/AF_XDP redirect path changes buffer lifetime.
  • Whether memory barriers are present before publishing descriptors and ringing the doorbell.

The kind of bug I would be looking for is a stale descriptor, early buffer recycle, wrong DMA length, unmap-before-completion, double-unmap, or a 64-bit DMA address truncated into a smaller descriptor field. I would also compare with IOMMU on/off only as an isolation step, not as a fix, because disabling IOMMU can hide a real driver ownership bug.

This maps well to my embedded-C experience: in one MediaTek MCU-DSP style issue, the important skill was not knowing the final answer upfront; it was proving which side owned the buffer at each point and finding the transition that violated the contract.

What they're listening for: The candidate should know `dma_addr_t`, mapping lifetime, direction, descriptor sanity, dmesg/IOMMU logs, and not treat disabling IOMMU as the solution.
Follow-ups
  • Why is an IOVA not the same thing as a physical address?
  • What does double-unmap or unmap-too-early look like?
  • How would jumbo frames expose a DMA length bug?
Explain a structured datapath-triage method for a packet that the application says it sent but the peer never received.senior

I would walk the packet as a set of ownership handoffs and ask where evidence stops. For TX, the layers are application, socket send buffer, TCP/UDP/IP stack, qdisc, driver ndo_start_xmit, TX ring, DMA, NIC, wire, switch, and peer RX. For RX, I reverse it: wire, NIC, RX ring/DMA, NAPI/driver, stack, socket, application.

My triage order would be:

  • Application: confirm the send call returned success and payload/peer address are correct.
  • Socket/stack: check ss -i, retransmits, UDP drops, qdisc stats with tc -s qdisc, and route/ARP/neighbor state.
  • Driver/ring: use ethtool -S for TX queue packets, drops, timeouts, and completion counters.
  • DMA/NIC: look for PCIe, DMA, IOMMU, firmware, and reset messages in dmesg.
  • Wire: compare host tcpdump with switch SPAN/TAP if needed, remembering that offloads can change what host captures show.
  • Peer: check the receiving host and switch counters before assuming transmit-side loss.

I would state uncertainty directly: at this point I know the application handed data to the kernel, but I do not yet know whether it reached the wire. Or: the switch saw the frame, so I can stop spending time in the local driver and move to the peer side. That is the discipline I used in MediaTek issue triage too: narrow the layer, state what is proven, then choose the next measurement.

What they're listening for: This tests layer-by-layer reasoning and communication of uncertainty while narrowing the search.
Follow-ups
  • Where can host tcpdump mislead you on the TX path?
  • What qdisc symptoms point above the driver?
  • How would you phrase uncertainty to a hiring manager?
How would you use ftrace or perf to make NIC latency debugging less speculative?senior

I would use perf first when I need statistical hotspots, and ftrace/trace-cmd when I need ordered events. perf top can show whether CPU time goes into the driver poll function, GRO, skb allocation, checksum, copy, locks, or scheduler overhead. ftrace can show when interrupts, NAPI polls, and scheduler switches happen around the spike.

A lightweight command sequence I would try in a lab is:

/* Command checklist, not compileable C. */
$ perf top -g --sort comm,dso,symbol
$ perf record -a -g -- sleep 30
$ perf report
$ trace-cmd record -e irq -e napi -e net -e sched_switch sleep 10
$ trace-cmd report

Then I would correlate trace time with application p99 events or packet timestamps. I am looking for patterns like IRQ handled quickly but NAPI delayed, NAPI poll runs but budget is exhausted, scheduler moves the app away from its queue CPU, or the driver completes TX late because coalescing delayed completions.

I would be careful with overhead. If the workload is very low-latency, full tracing can perturb it, so I would start broad and then narrow filters. This is similar to firmware tracing in my previous role: the trace must answer a precise question, otherwise it becomes noise and changes the timing of the bug.

What they're listening for: Strong answers distinguish sampling from ordered tracing and mention overhead and correlation with p99 events.
Follow-ups
  • When would perf be insufficient and ftrace be better?
  • What trace pattern indicates NAPI budget pressure?
  • How can tracing perturb a low-latency workload?
A single flow is hot and one RX queue drops while other queues are idle. What would you check?senior

I would first verify that it is truly one flow or one hash bucket by looking at per-queue counters, interrupt counts, and RSS configuration. RSS spreads flows, not packets from one ordered flow, so a single elephant flow can legitimately land on one queue. I would not assume the NIC is broken just because the queues are imbalanced.

I would check:

  • ethtool -S eth0 for per-RX-queue packets, drops, no-buffer events, and IRQ counts.
  • /proc/interrupts for whether one vector is much hotter.
  • ethtool -x eth0 for the RSS indirection table.
  • ethtool -n eth0 for flow classification rules if the driver supports them.
  • Whether encapsulation, VLAN, or tunnel headers are included in the RSS hash.
  • Application CPU placement and socket affinity.

Possible responses are workload sharding, changing the RSS indirection table, adding hardware flow rules, using application-level parallelism, or accepting that one ordered TCP/UDP flow cannot be split across queues without reordering risk. I would be cautious with RPS: it can move later processing to another CPU, but the original queue still paid the interrupt and DMA-completion cost, and the extra IPI/cache movement may hurt p99.

The concise interview answer is: first prove the imbalance, then decide whether it is expected hash behavior, a missing tunnel hash capability, a bad affinity setup, or a real ring-refill/driver issue.

What they're listening for: The candidate should understand RSS, queue mapping, flow ordering, and why software steering is not free.
Follow-ups
  • Why can one elephant flow defeat RSS scaling?
  • How can encapsulation break expected RSS distribution?
  • Why might RPS improve throughput but hurt latency?
How would you communicate uncertainty during a NIC incident while still making progress?senior

I would separate facts, hypotheses, and next measurements. In an interview I would say something like: I have not shipped this NIC driver, so I am not going to guess the private counter meaning. What I can do is use the public Linux tools and the driver docs/source to narrow the layer.

A good incident update from me would sound like:

  • We know the application send path is succeeding because send returns success and socket errors are clean.
  • We do not yet know whether frames reach the wire.
  • Host TX packet counters increase, but switch ingress counters do not, so the current suspect layer is driver/NIC/physical link.
  • Next I will compare TX completion counters, dmesg reset/AER logs, and a TAP or switch SPAN capture.
  • If switch ingress counters do increase, I will stop investigating local TX and move to the switch/peer RX side.

That style matters because it prevents overclaiming. My strongest evidence from MediaTek is not that I already know Ethernet driver internals end to end; it is that I have repeatedly taken ambiguous HW/SW failures, built a trace/KPI/core-dump picture, and closed them by narrowing the boundary. For AMD networking, I would bring the same discipline and be transparent about which parts I studied versus shipped.

What they're listening for: This answer should sound honest and senior: precise uncertainty, no bluffing, clear next measurement.
Follow-ups
  • Give an example of a fact versus a hypothesis in this incident.
  • How do you avoid wasting time once a layer is ruled out?
  • How would you answer if an interviewer asks whether you have shipped a Linux NIC driver?
What driver or kernel-source details would you inspect after the first round of counters points to the NIC driver?senior

If counters point into the driver, I would read the driver source around the exact queue path instead of guessing from generic Linux knowledge. I would look at the ethtool stats definitions first, because private counter names only mean what the driver says they mean. Then I would inspect RX refill, TX completion, NAPI poll, IRQ handler, ring resize, coalescing, reset, and DMA map/unmap paths.

Specific questions I would ask in code are:

  • Which counter increments at the observed drop point?
  • Does RX refill fail because allocation fails, DMA mapping fails, or the page pool/AF_XDP fill ring is empty?
  • Does the NAPI poll respect budget and only re-enable IRQs after proper completion?
  • Are TX queues stopped and woken with enough descriptor headroom?
  • Are descriptor writes ordered before ownership bits and MMIO doorbells?
  • Are DMA mappings created, synced, and unmapped with the correct lifetime?
  • Does reset quiesce DMA and NAPI before freeing rings?

I would connect this to my embedded-C background: I am comfortable reading C at the HW/SW boundary, tracing state machines, and checking ownership transitions. The gap is that I would be new to this exact Ethernet driver stack, so I would lean on kernel docs, driver source, and careful measurements rather than pretending I have years of private driver intuition.

What they're listening for: Strong answers move from counters to source-code ownership points and explicitly keep Mohamed's gap honest.
Follow-ups
  • Why should you read the driver's stat definitions before interpreting `ethtool -S`?
  • Which paths are most likely to cause drops under bursty RX load?
  • What source pattern would make you suspect a DMA ordering bug?