βš”οΈEarn the Vocabulary β€” the follow-ups that catch a career-changer

The mechanism answers behind every NIC term you'll drop β€” learn these cold so 'doorbell', 'Onload', 'kernel bypass' and 'descriptor ownership' are earned, not borrowed.

You list TCP/IP as a skill. Walk a TCP segment from send() to the wire.

Be honest first β€” I'd say my TCP/IP is foundations, not production. But here's the model.

When send() returns, the kernel has the bytes in an sk_buff; TCP segments them and adds its header (ports, seq/ack, window, checksum), IP adds src/dst, TTL, protocol and a header checksum, then L2 wraps it in an Ethernet frame: dst MAC, src MAC, EtherType, payload, and the FCS/CRC at the end.

The NIC then DMAs the frame out of host memory and serialises it onto the wire. Modern NICs offload work: TSO lets the host hand over one big buffer and the NIC splits it into MSS segments and fills per-segment checksums; the L4 checksum is usually offloaded, and the Ethernet FCS is always the NIC/PHY.

I haven't implemented a TCP stack β€” but I can reason about where the NIC touches the packet, which is the part that matters here.

TCP encapsulation β€” each layer prepends its header (TCP 20 B, IP 20 B, Ethernet 14 B) plus a trailing FCS, then it is serialised onto the wire.
TCP encapsulation β€” each layer prepends its header (TCP 20 B, IP 20 B, Ethernet 14 B) plus a trailing FCS, then it is serialised onto the wire.

Watch the trip β€” each layer prepends its header:

Application
TCP
IP
Link (driver)
NIC / DMA
Wire
send()
payloadyour data
wire β†’
1/6send(fd, buf, len) hands your bytes to the kernel β€” just the payload, in a socket buffer.
Ethernet header + FCS β€” added by the link layer / NIC
IP header β€” routing, TTL, header checksum
TCP header β€” ports, seq/ack, window, checksum (often NIC-offloaded)
your payload β€” never copied again under zero-copy / kernel bypass
What they're listening for: They're testing whether 'TCP/IP' on your CV is real. Win by being precise about the frame layout and what the NIC offloads, and by owning the foundations-not-production line before they force it.
Likely follow-ups
  • Which checksum does the NIC compute β€” the L2 FCS or the TCP checksum?
  • What are TSO / GSO / GRO and why do they matter for throughput?
You said the NIC offloads the TCP checksum. The TCP checksum covers a pseudo-header β€” so what does the host do and what does the NIC do?

This is the part people skip, and it's a fair test of whether I understand offload rather than just the word. The TCP checksum is a one's-complement sum over a pseudo-header (source and dest IP, protocol, TCP length) plus the TCP header and payload. The NIC doesn't want to walk the whole payload twice or understand IP addressing, so the work is split.

The host computes the cheap, payload-independent part: the pseudo-header sum. In Linux that's CHECKSUM_PARTIAL β€” the stack puts the pseudo-header's one's-complement sum into the checksum field and tells the NIC, via csum_start and csum_offset, where the checksum field is and where to start summing. The NIC then does the expensive part as it streams the bytes out: it sums from csum_start to the end of the packet and folds that into the value already sitting in the field.

Why it composes: because a value plus its one's complement is all-ones, seeding the field with the pseudo-header sum means the NIC's running sum over the rest cancels out to a correct checksum without the NIC ever parsing IP. I've studied this rather than shipped it, but the elegance is the same one's-complement identity that makes 'a valid IP header sums to 0xFFFF' work.

What they're listening for: A precision trap on a claim you made. They want: host does the pseudo-header sum (CHECKSUM_PARTIAL, csum_start/offset), NIC sums the rest as it transmits. Knowing the split, not just 'the NIC does the checksum', is the signal.
Likely follow-ups
  • Why can the NIC compute its part without understanding IP?
  • What is CHECKSUM_PARTIAL versus CHECKSUM_UNNECESSARY on RX?
  • How does this relate to the one's-complement identity in the IPv4 checksum?
Why is the kernel slow enough that we bypass it? Name the actual costs.

Kernel bypass removes a specific list of per-packet costs β€” it's not 'the kernel is bad':

  • the send/recv syscall β€” a userβ†’kernel mode switch every operation
  • a context switch / wakeup when a blocked thread is scheduled
  • `sk_buff` allocation and per-packet bookkeeping
  • a copy between kernel and user memory
  • interrupt overhead instead of busy-polling

Bypass (Onload/ef_vi) lets the app talk to the NIC rings directly in userspace and poll them, so on the hot path none of those are paid. The trade: you give up the kernel's isolation, generality and protocol surface and take on more responsibility β€” worth it only on latency-critical paths.

What they're listening for: List the concrete costs (syscall, context switch, sk_buff, copy, interrupt). That's what separates 'read the website' from 'understands the product'. Never say 'the kernel is bad'.
Likely follow-ups
  • Where does busy-polling help and what does it cost?
  • Why isn't bypass the right default for everything?
Busy-polling burns a whole core. Quantify what bypass actually buys you, and what it costs.

I'd give the shape with honest orders of magnitude rather than fake precision. The kernel socket path for a small UDP/TCP message costs on the order of a few microseconds each way β€” syscall, stack traversal, copy, and the wakeup latency if the receiver was blocked, which is the killer for tail latency because the scheduler decides when you run. A kernel-bypass receive that's already busy-polling its ring sees the packet in hundreds of nanoseconds, often well under a microsecond, because there's no syscall, no copy, no wakeup, and no interrupt-to-thread handoff.

The costs I'd name, so I don't sound like a brochure:

  • A core is pinned at 100% spinning on the ring whether or not traffic arrives β€” power and an entire core gone.
  • No blocking/`epoll` efficiency: bypass trades CPU for latency, which is only sane when latency is the product.
  • You own more: framing, retransmit logic if you're below TCP, and bugs that the kernel would have contained.

So the real win isn't just median latency, it's determinism β€” cutting the scheduler and interrupts out of the path collapses the long tail. That's exactly the trading and AI-cluster argument: the p99.9 matters more than the mean. I've reasoned about these numbers from the mechanism, not measured them on Solarflare hardware myself.

What they're listening for: A skeptic wants numbers, not adjectives. Microseconds (kernel) vs sub-microsecond (bypass), and the honest cost: a pinned core and lost generality. The 'determinism / tail latency' framing is the senior signal.
Likely follow-ups
  • Why does cutting the scheduler out help the tail more than the mean?
  • When would you still choose the kernel path despite the latency?
  • What is the power cost of busy-polling and how do teams mitigate it?
What's the difference between Onload and ef_vi, and when would a customer pick each?

Both are the Solarflare userspace datapath, at different levels:

  • Onload is a full userspace TCP/IP stack behind a sockets-compatible shim (via LD_PRELOAD). An unmodified BSD-sockets app gets accelerated transparently β€” same send/recv, just kernel-bypassed.
  • ef_vi is the layer-2 API underneath: raw, direct access to the NIC's RX/TX descriptor rings and event queue. You write your own framing and get the absolute lowest latency.

So a customer picks Onload to accelerate without rewriting the app, and ef_vi when they'll write to the metal for every last nanosecond β€” e.g. a trading stack doing its own packet handling.

What they're listening for: The single highest-leverage fact for this exact team. Crisp Onload-vs-ef_vi signals real prep; vagueness reads as name-dropping.
Likely follow-ups
  • If you bypass the kernel's TCP stack, who implements TCP?
  • Which would you reach for first, and why?
Where does TCPDirect sit between Onload and ef_vi?

TCPDirect is the in-between point, and knowing it exists shows I looked past the two headline names. Onload gives you the full POSIX sockets surface kernel-bypassed; ef_vi gives you raw layer-2 frames and you bring your own everything. TCPDirect is a del-copy, ultra-low-latency TCP and UDP stack with its own dedicated API β€” so you still get TCP/UDP semantics, but not through the sockets compatibility layer, and it's tuned to shave latency further than Onload by trading away that generality and some POSIX completeness.

The way I'd rank them by latency-vs-effort: Onload (lowest effort, accelerate an existing sockets app), TCPDirect (rewrite to a zero-copy API, still get TCP, lower latency), ef_vi (raw L2, you implement framing, lowest latency of all). A customer moves down that ladder only as far as their latency budget forces them, because each step costs engineering and gives up generality.

I've studied the product line rather than built on it, but the gradient β€” sockets compatibility traded for nanoseconds β€” is the consistent theme across all three.

What they're listening for: A 'did you really read past the website' probe. The answer is the ladder: Onload (sockets shim) β†’ TCPDirect (zero-copy TCP via its own API) β†’ ef_vi (raw L2). Placing TCPDirect correctly is the tell.
Likely follow-ups
  • Why would a TCP user pick TCPDirect over Onload?
  • What does 'zero-copy' cost you in API ergonomics?
  • Which of the three would never be the right answer for a legacy app?
A doorbell is a posted MMIO write. If you ring it before the descriptor is visible to the device, what breaks β€” and what goes between them?

If the doorbell overtakes the descriptor write, the NIC can read a stale or half-written descriptor and DMA the wrong buffer or garbage.

The descriptors usually live in coherent DMA memory, so the barrier I want before the doorbell is dma_wmb() β€” it makes the descriptor stores visible to the device, and it's lighter than a full wmb() because it only orders accesses to coherent memory, not all of system memory. One subtlety I'd call out: dma_wmb() orders the descriptor writes; the doorbell itself is an MMIO write, and on Linux the writel() accessor carries its own ordering against prior memory writes, so the canonical sequence is *write descriptor β†’ dma_wmb() β†’ writel() doorbell*. The point is the publish ordering, not memorising one macro.

The completion side mirrors it: the device flips an owner/done bit to hand the buffer back, and on a weakly-ordered CPU you need an acquire-style barrier (`dma_rmb()`) after reading the owner bit before reading the rest of the descriptor, or you see the flag set but stale length/status fields.

I've reasoned about and traced this, not shipped it β€” but it's the exact class of MCU-DSP ownership bug I debugged at MediaTek.

What they're listening for: The deepest trap: you used doorbell/completion/owner-bit fluently, so they test the mechanism. Earn it β€” `dma_wmb()` orders the descriptor (coherent memory), `writel` orders the MMIO doorbell, `dma_rmb()` is the owner-bit acquire β€” then tie it to your real MCU-DSP bug.
Likely follow-ups
  • Why `dma_wmb()` and not a full `wmb()` for the descriptor?
  • Owner bit in the descriptor vs a separate completion queue β€” difference?
  • How is this the same as release/acquire in a lock-free ring?
You rang the doorbell. How do you know the device actually saw it? MMIO writes are posted.

Right β€” a posted PCIe memory write gets no completion, so the CPU's writel() retiring only means the write left the CPU, not that the device consumed it. Normally that's fine and desirable: you don't want to stall the CPU waiting on the device for every doorbell; the device will get it and the ordering rules guarantee it lands after the descriptor.

When you genuinely must know it arrived β€” typically at teardown, or before reading device state that depends on the write β€” the idiom is a read-back of a device register (a non-posted read). PCIe ordering says a read to the same device cannot pass the earlier posted writes, so when the read returns, the write has reached the device. So it's *write β†’ read-back to flush*.

The caveat I'd add so I don't sound like I'd sprinkle it everywhere: a read-back is expensive β€” it's a full round-trip to the device, hundreds of nanoseconds β€” so you don't put one after every doorbell on the hot path; you rely on posted-write ordering there and only force the flush when correctness needs it, like during shutdown or a reset sequence. This is the PCIe analogue of the same 'when is it really visible' question I dealt with between MCU and DSP.

What they're listening for: Tests whether you understand posted vs non-posted PCIe. Answer: posted writes have no ack; a non-posted read-back flushes them; but it's costly, so only when correctness demands it, never per-doorbell on the hot path.
Likely follow-ups
  • Why does a read to the same device flush prior posted writes?
  • Why would a read-back after every doorbell be a performance bug?
  • When in a driver's lifecycle do you actually need the flush?
Sketch a NIC RX path, wire to application.

High level:

  • The driver pre-posts empty buffers on the RX ring and DMA-maps them.
  • A frame lands; the NIC DMAs it into the next posted buffer, writes length/status, sets the owner/done bit, and raises an interrupt.
  • The ISR does almost nothing: it schedules NAPI and masks further RX interrupts.
  • NAPI poll drains the ring in a batch β€” the key idea, amortising interrupt cost over many packets β€” builds an sk_buff per packet, pushes them up via napi_gro_receive/netif_receive_skb, and refills the ring.
  • The stack does IP/TCP; data is copied to userspace on recv.

The bypass version cuts the corner: the NIC steers the flow to a userspace VI queue, the app busy-polls the ring directly β€” no interrupt, no sk_buff, no copy.

That's the model as I've studied it, not shipped β€” but I can trace it.

What they're listening for: They'll ask this live to test whether 'I'm studying the datapath' is real. Narrate it cold: DMA into posted ring β†’ owner bit β†’ IRQ β†’ NAPI batch β†’ sk_buff β†’ up; bypass = userspace poll, no IRQ/copy.
Likely follow-ups
  • Where does NAPI switch from interrupt to poll, and why?
  • What's different on the TX path?
Walk the TX completion path. After the app sends, how does the buffer get freed, and what's the trap?

TX is the mirror of RX and the trap is buffer lifetime. The app's data goes into a buffer, the driver writes a TX descriptor pointing at it (DMA address, length, flags like 'request completion', offload bits), then dma_wmb() and rings the TX doorbell. Crucially, at that moment the NIC has not sent the bytes yet β€” it will DMA them out of host memory asynchronously.

So the trap is: you must not free or reuse that buffer until the device signals it's done with it. The NIC writes a TX completion β€” flipping an owner/done bit or posting to a completion queue β€” and only then does the driver reclaim the buffer (in Linux, dev_kfree_skb/unmap the DMA). Freeing on doorbell-return instead of on completion is a classic use-after-free that DMAs stale or recycled memory onto the wire, and it's intermittent because it depends on timing.

Two more details I'd mention. Completions are usually batched and coalesced β€” you don't get an interrupt per packet, you reclaim a run of descriptors when you next poll, same amortisation idea as NAPI on RX. And there's a back-pressure dimension: if completions lag, the TX ring fills, and the driver has to stop the queue (netif_stop_queue) and wake it when space frees up, rather than overrun the ring. The ownership rule is identical to the MCU-DSP boundary: producer can't reclaim until the consumer has provably finished.

What they're listening for: RX is the rehearsed one; TX completion catches people. The key insight: free the buffer on *completion*, not on doorbell-return β€” anything else is a DMA use-after-free. Bonus for completion coalescing and queue stop/wake back-pressure.
Likely follow-ups
  • What goes wrong if you free the buffer right after ringing the doorbell?
  • Why are TX completions coalesced, and what's the latency trade?
  • How does the driver apply back-pressure when the TX ring is full?
Your DSP firmware is hard-real-time. Our datapath is throughput-bound at line rate. Where does your real-time intuition mislead you?

Fair distinction, and I'd name it myself. My real-time background is periodic and deadline-bound β€” a slot every so often, known workload, optimise for determinism and worst case. A NIC fast path is event-driven, bursty and throughput-bound β€” at line rate a small packet can arrive every few nanoseconds, and you optimise amortised per-packet cost under adversarial bursts.

What transfers: bounded work, no surprise allocations on the hot path, cache/data-layout awareness, ownership discipline. What I'd unlearn is optimising every event for worst-case latency when the right model is batching to amortise cost β€” NAPI is exactly that, the opposite of per-event determinism.

What they're listening for: A skeptic distrusts the glib 'same mindset'. Pre-empt it: name throughput-vs-determinism and batching yourself. Knowing what's *different* beats insisting it's all the same.
Likely follow-ups
  • Give an example of a cache/data-layout decision on a hot path.
  • Why does batching help throughput but hurt worst-case latency?
Do the line-rate arithmetic. At 100GbE with 64-byte frames, how much time do you get per packet, and what does that rule out?

Let me actually do it rather than wave at 'it's fast'. On Ethernet a 64-byte frame isn't 64 bytes on the wire β€” you add the preamble + start-of-frame (8 bytes) and the inter-packet gap (12 bytes), so the minimum is 84 bytes = 672 bits per packet. At 100 Gbit/s that's about 148.8 million packets per second, which is roughly 6.7 nanoseconds per packet.

Six-and-a-bit nanoseconds is brutal once you anchor it: a last-level-cache miss to DRAM is ~100 ns, so a single cache miss is ~15 packets' worth of budget. That arithmetic rules things out immediately β€” you cannot take a syscall, an interrupt, an sk_buff allocation, or a lock per packet at that rate. It forces batching (handle many packets per poll so fixed costs amortise), prefetching descriptors and headers ahead of use, cache-line-friendly layouts, and steering flows across multiple queues/cores (RSS) because one core physically can't keep up.

This is exactly the bypass argument made quantitative β€” the per-packet overheads I listed for the kernel path simply don't fit in 6.7 ns. I've reasoned this from the numbers rather than tuned a 100G NIC myself, but the budget is the thing that makes the whole datapath design make sense.

What they're listening for: The classic 'is your low-latency instinct real' test. They want the 148.8 Mpps / ~6.7 ns figure (with the 84-byte minimum), and the punchline that a ~100 ns cache miss dwarfs the budget β€” so no per-packet syscall/alloc/lock, hence batching + prefetch + RSS.
Likely follow-ups
  • Why is the on-wire minimum 84 bytes, not 64?
  • How many packets' budget does one DRAM miss cost?
  • How does this justify multi-queue / RSS?
Stop telling me what UL-DAI is β€” tell me the hardest bug you personally root-caused.

*(Template β€” fill with your real UL-DAI / missing-DL-DCI / MCU-DSP war story. The structure that lands:)*

  • Symptom: a specific observable β€” a KPI/counter that moved, or a mis-decode in a specific scenario.
  • Wrong first hypothesis: what you chased first and why it was reasonable but wrong.
  • Root cause: the actual mechanism β€” a state/timing/ownership issue between MCU and DSP, or a DAI counter edge case at wrap.
  • Proof: from traces/core dumps, lining up expected vs actual sequence; a before/after number if you have one.
  • The fix + the regression you added.

Keep it in the firmware domain; resist 'the NIC equivalent' until the end β€” land the concrete story first, then one sentence of bridge.

What they're listening for: Your strongest asset β€” but only as a concrete lived story, not process narration. The trap is escaping to analogy before landing symptom + wrong-turn + root-cause + proof. Have ONE rehearsed cold.
Likely follow-ups
  • What edge case came back from integration?
  • What's the NIC-datapath equivalent of that bug?
How does a userspace process safely tell the NIC to DMA into its memory without the kernel mediating every packet?

The setup stays privileged even though the datapath doesn't. Ahead of time the driver/kernel creates the queues and registers/pins the memory regions the NIC may DMA into for that app, and programs the NIC and usually the IOMMU so the device can only touch those regions. After that the app posts buffers and rings doorbells directly in userspace β€” but it can only reach its own pre-registered memory, because the IOMMU/queue setup enforces it.

So safety comes from registration + IOMMU translation/protection at setup, not from the kernel checking every packet. That's what makes bypass both fast and safe.

What they're listening for: Tests whether you get that bypass is fast *and* safe. Answer: memory registration + IOMMU enforcement at setup; the per-packet path is unprivileged but bounded to registered memory.
Likely follow-ups
  • What role does the IOMMU / IOVA play here?
  • What stops one app's NIC queue from reading another's buffers?
Spell out the IOMMU's job. What exactly does it translate, and what attack does it stop?

The IOMMU is to a device what the MMU is to a CPU. A device issues DMA using an I/O virtual address (IOVA), and the IOMMU translates that to a physical address through per-device page tables before the memory controller sees it. Two consequences matter.

Protection: a device can only reach physical pages that have actually been mapped into its IOVA space. Without an IOMMU, a DMA-capable device can read or write any physical memory β€” that's the foundation of DMA attacks (a malicious or buggy device, or in bypass a misbehaving app's queue, scribbling over kernel memory or another process). With the IOMMU, an unmapped or out-of-bounds DMA faults instead of corrupting memory. That's precisely what makes userspace bypass safe: the app drives its queue directly, but its descriptors can only name IOVAs that map to its own pinned buffers.

Decoupling: because the device sees IOVAs, the buffers don't need to be physically contiguous β€” the IOMMU gathers scattered physical pages into a contiguous-looking IOVA range, which is why scatter-gather DMA works cleanly.

So to the follow-up 'what stops one app's queue reading another's buffers' β€” the answer is the IOMMU page tables for that queue simply don't map the other app's pages, so the DMA faults. I've studied this; the analogy I lean on is virtual memory, which I do understand from the CPU side.

What they're listening for: The mechanism behind the 'bypass is safe' claim. They want IOVA→physical translation via per-device page tables, that unmapped DMA faults (stops DMA attacks / cross-app reads), and the scatter-gather decoupling bonus.
Likely follow-ups
  • How is this analogous to a CPU MMU and virtual memory?
  • What's the cost of IOMMU translation on the DMA hot path?
  • How does scatter-gather DMA benefit from IOVA?
RSS spreads flows across queues and cores. How does the NIC decide which queue, and what's the subtle correctness issue?

RSS β€” receive-side scaling β€” lets the NIC fan incoming packets across multiple RX queues so different cores process different flows in parallel, which is mandatory at line rate because one core can't keep up. The mechanism: the NIC computes a hash (typically Toeplitz) over a flow tuple β€” usually the 5-tuple of src/dst IP, src/dst port, protocol β€” takes some low bits of the hash as an index into a small indirection table, and that entry names the target queue. So all packets of one flow hash the same way and land on the same queue and core.

The subtle correctness/performance issue is flow-to-core affinity and ordering. Because RSS is per-flow, a single TCP connection stays on one core, which preserves in-order delivery for that flow and keeps its state cache-hot. But if the hash only covered, say, IPs and not ports, all flows between two hosts would collapse onto one queue and you'd lose the spread β€” a real misconfiguration. And the flip side, flow steering / Flow Director / aRFS, exists to override the hash so a flow lands on the *specific* core where the consuming application thread runs, avoiding a cross-core handoff and cache bounce.

The through-line to low latency: you don't just want spread, you want the packet to arrive on the same core that will consume it, so RSS plus app-aware steering is about cache locality as much as parallelism. I've studied this rather than tuned it on hardware.

What they're listening for: Tests depth behind a buzzword. Want: Toeplitz hash over the 5-tuple β†’ indirection table β†’ queue; per-flow affinity preserves ordering and cache locality; and steering (Flow Director/aRFS) to land a flow on the consuming core. The 'same core that consumes it' insight is the senior tell.
Likely follow-ups
  • Why hash the full 5-tuple instead of just IPs?
  • How does keeping a flow on one core help TCP and the cache?
  • What does Flow Director / aRFS add on top of plain RSS?
Bypass skips the kernel TCP stack β€” so who retransmits a lost packet, handles congestion, and reassembles out-of-order segments?

Whoever provides the stack you chose β€” the function doesn't vanish, it just moves. This is the honest cost side of bypass.

  • With Onload or TCPDirect, there's a full userspace TCP implementation doing exactly what the kernel did: sequence numbers, ACKs, retransmit timers, fast retransmit, a congestion-control algorithm, receive-window flow control, and reassembling out-of-order segments into an in-order byte stream. You bypass the *kernel's* TCP, but you're still running *a* TCP, just in the app's address space, polling the NIC. So reliability and congestion control are intact.
  • With ef_vi you're at raw layer 2 β€” frames, no TCP at all. So if the app needs reliability it implements it itself, or, far more commonly, it's running a protocol that doesn't need kernel TCP: UDP multicast market data, a custom request/response with its own sequencing, or RDMA-style transports where the NIC hardware handles reliability.

The senior point is the trade: dropping to ef_vi for the lowest latency means you've taken ownership of everything TCP gave you for free β€” that's a deliberate decision a trading or AI-cluster team makes only where the protocol is simple enough to own. It's not magic; it's a transfer of responsibility, the same way bypassing an RTOS service means you now own that timing yourself.

What they're listening for: Catches people who think bypass deletes TCP's hard problems. Answer: Onload/TCPDirect still run a full userspace TCP (retransmit/congestion/reassembly intact); ef_vi has none, so the app owns reliability or uses a protocol that doesn't need it. Frame it as transfer of responsibility.
Likely follow-ups
  • Why might a market-data feed happily use ef_vi with no TCP?
  • What does a userspace TCP have to re-implement that the kernel gave you?
  • How does RDMA change who owns reliability?
Why can the first packet after idle be much slower than packets once the path is warm?

I would answer this as a cache and locality problem first, not as a mystery NIC problem. A warm polling path has the loop instructions, ring state, descriptors, queue metadata, flow state and hot branches already in the CPU's caches and predictors. After idle, much of that state may be gone or cold, the core may have downclocked or entered a deeper idle state, and the first packet pays the wakeup and refill cost.

The mechanism is concrete:

  • Instruction cache and branch predictor: the poll loop, parser and completion path may have been evicted, and rare first-packet branches can mispredict.
  • Data cache: the RX/TX ring, descriptor cache lines, queue state, socket or flow state and buffer metadata may miss in L1/L2/LLC. A DRAM miss can be tens to around a hundred nanoseconds, which is already huge on a low-latency path.
  • Memory locality: if the queue, buffer pool or application thread is on the wrong NUMA node or CPU, the first touch may cross socket or chiplet boundaries.
  • Power state: if the core was asleep or running at a low frequency, the first packet includes wake and frequency-ramp effects that later packets do not.

What I would do in a latency-sensitive design is keep the hot poll loop genuinely hot: pin the poll thread, keep queue and application affinity aligned, allocate buffers on the NIC-local NUMA node, avoid allocation on the first packet, and keep the hot per-queue structure compact. I would separate hot fields from cold debug counters, avoid pointer chasing in the first cache line, and use per-queue/per-core state so one packet does not bounce shared cache lines across CPUs.

Prefetch can help when the next address is predictable. For example, while processing descriptor i, the loop can prefetch descriptor i + 2 and maybe the packet header, but only after ownership is valid and only if measurement shows it helps. Prefetch too early or too far ahead can pollute the cache and worsen the tail.

The interview bridge is honest: I have not tuned a production Solarflare or AMD NIC poll loop, but I have worked on TX DSP firmware where a small cold-path or data-layout mistake can violate a real-time budget. The transferable skill is to ask which bytes and instructions the CPU must touch on the first packet, keep that working set small and local, then prove the tail with measurements rather than folklore.

What they're listening for: A mechanism drill: cold cache, branch predictor, NUMA, power state, prefetch and struct layout. The trap is saying only 'cache miss' without explaining why the first packet after idle is slow or how to keep the poll loop hot.
Likely follow-ups
  • Which fields belong in the first cache line of per-queue state?
  • How can prefetch hurt p99?
  • Why can CPU power state look like a networking latency bug?
  • How would you measure cold versus warm packet latency?