๐ŸงญLive Scenarios & Mini-Designs

Open-ended first-screen prompts for AMD networking: not discrete C puzzles, but small system designs, debugging walkthroughs, latency tradeoffs, and explanations that reveal how Mohamed thinks at the hardware/software boundary.

Design a fixed-capacity in-memory packet or event logger for a NIC datapath. What decisions do you make?

How I would approach it: I would first ask where this logger runs: normal thread, softirq/NAPI-like context, hard interrupt, or firmware/RTOS context. Then I would fix the constraints: no allocation on the hot path, bounded work, clear overflow policy, and records small enough not to destroy the thing we are measuring.

My spoken answer would be: "I would start with a fixed-size ring of records. Each record has a timestamp or cycle counter if available, an event id, queue id, packet length, maybe a small reason code, and one or two context fields. I would avoid copying payloads unless the debug mode explicitly asks for it. For normal operation the logger should be cheap and mostly disabled; when enabled it still must be bounded."

The key design choice is overflow. For post-mortem debug I usually prefer overwrite-oldest, because the most recent events around the failure are often what I need. For audit-style diagnostics I would prefer drop-newest and count drops, because losing old events changes the history. I would make that policy explicit rather than hiding it.

For synchronization, if this is single producer and single consumer, I would use an SPSC ring with separate producer and consumer indices. If a hard interrupt can log, I would not take a sleeping lock. I would either keep it per-CPU/per-queue so the producer is effectively single, or use a very small atomic reservation scheme. The consumer can drain later to debugfs, trace output, or a userspace dump.

The hardware/software bridge from my background is direct: I have debugged timing and ownership issues at an MCU-DSP boundary. I have not shipped a Linux NIC driver, so I would verify the exact kernel tracing primitives, but the invariants are familiar: fixed memory, bounded latency, correct ownership, and diagnostics that do not create a new timing bug.

What they're listening for: They want to see bounded design, overflow policy, synchronization context, and measurement humility. The trap is designing an unbounded printf-style logger in the packet path.
Likely follow-ups
  • Would you overwrite old records or drop new ones?
  • How would this change in hard interrupt context?
  • What fields are worth logging without making the logger too expensive?
  • How would you prove the logger is not causing the latency spike?
How would you move packets or packet events from an interrupt handler to a worker thread?

How I would approach it: I would separate the urgent interrupt work from the deferred work. The interrupt side should acknowledge or mask the device cause, capture the minimum state needed, publish work to a queue, and return quickly. The worker owns the expensive parsing, logging, recovery, or user-visible reporting.

My spoken answer would be: "I would use a bounded producer/consumer queue, ideally SPSC if one interrupt source maps to one worker. The interrupt handler produces descriptors or event pointers, not big copied packets. It writes the entry first, then publishes the updated producer index with the required ordering. The worker reads the producer index, drains entries, and advances the consumer index after it has taken ownership."

The important part is ownership. At any point a packet buffer or event record is owned by hardware, the interrupt side, the worker, or the free pool. I would be explicit about transitions so there is no double-free, stale read, or buffer reuse while the worker still inspects it.

For memory ordering, I would not rely on volatile. In kernel code I would use the appropriate barriers or queue primitives; conceptually the producer must make the record contents visible before the index says it is available, and the consumer must read the contents after observing that index. If the queue is full, the interrupt side must not spin forever. It either drops with a counter, coalesces events, or schedules a reset/slow-path signal depending on severity.

That maps well to my embedded work because the MCU-DSP bug pattern is the same: one side publishes state, another side consumes it, and most failures come from unclear ownership or ordering assumptions. The Linux names are different, but the producer/consumer reasoning is the same.

What they're listening for: A strong answer names deferral, SPSC rings, ownership, bounded failure behavior, and memory ordering. The trap is doing heavy work or blocking in the interrupt handler.
Likely follow-ups
  • What happens if the queue is full?
  • Where do you need memory barriers?
  • Would you copy packets or pass ownership of buffers?
  • How would per-queue workers help latency?
A customer reports dropped packets under load. How do you debug it?

How I would approach it: I would avoid starting with a pet theory. I would locate the drop point first, then narrow by correlation: traffic pattern, queue, packet size, CPU, interrupt behavior, firmware state, and configuration.

My spoken answer would be: "First I would define the symptom precisely. Is it RX drops, TX drops, application misses, retransmits, or a counter that increments somewhere? Does it happen only at line rate, only for small packets, after reset, on one queue, with one MTU, or with a certain offload enabled? Then I would build a datapath map and put counters at each boundary."

For RX I would look at hardware MAC/PHY counters, NIC queue/ring counters, descriptor exhaustion, DMA mapping or allocation failures, interrupt moderation, NAPI budget pressure, softnet drops, socket buffer drops, and application receive behavior. For TX I would look at queue stop/wake behavior, completion handling, descriptor leaks, DMA errors, and whether the device is actually transmitting.

I would also check the boring things early: link errors, MTU mismatch, filters, RSS/queue configuration, CPU affinity, NUMA placement, firmware/driver version, and whether the repro is a burst problem rather than average bandwidth. At 100GbE, small packets create a completely different pressure profile from large packets.

The honest bridge is that my shipped experience is wireless PHY/embedded firmware, not Linux NIC production support. But I have triaged 100+ issues and worked bugs where the symptom crossed the hardware/software boundary. My method would be the same: make the path observable, split by ownership boundary, reduce to a reproducible case, and only then change code.

What they're listening for: They are checking debugging structure, not whether you know every Linux counter by heart. The trap is jumping straight to 'increase the ring size' or blaming the customer workload.
Likely follow-ups
  • Which counters would you inspect first?
  • How do you tell hardware drops from kernel or application drops?
  • What if drops only happen for 64-byte packets?
  • What information would you ask the customer for?
Estimate the per-packet time budget at 100GbE for minimum-size packets. What does that rule out?

How I would approach it: I would do a quick Fermi estimate and say exactly what assumptions I am using. Ethernet minimum frames are 64 bytes, but on the wire you also pay preamble, start delimiter and inter-frame gap. The common line-rate figure is about 148.8 million packets per second at 100GbE for minimum Ethernet frames.

My spoken answer would be: "So the budget is roughly 1 / 148.8 Mpps, which is about 6.7 ns per packet. At a 3 GHz CPU, one cycle is about 0.33 ns, so that is only about 20 CPU cycles per packet if a single core tried to touch every minimum packet. That is not a real software budget for rich processing. It tells me the design must batch, spread across queues and cores, use DMA and hardware offloads, and keep per-packet software work extremely small."

What it rules out is anything like allocation per packet, logging per packet, locks on a shared global structure, cache-missing through several pointer layers, syscalls per packet, or long branchy parsing on the hot path. Even one last-level cache miss can be many tens of nanoseconds, which is already multiple packets at this rate.

I would also be careful not to overinterpret the estimate. Real systems use multiple queues, batching, larger average packet sizes, interrupt moderation, polling, and hardware classification. The point is not that software can only spend 6.7 ns total; the point is that minimum-packet line rate forces architecture decisions. You cannot treat packets as ordinary high-level objects and expect 100GbE behavior.

This is a good place for my PHY background to transfer: radio firmware also lives under timing budgets where a tiny-looking operation can be disallowed by the schedule. The numbers change, but the habit of converting throughput into time and then into design constraints is the same.

What they're listening for: They want estimation and design consequences. The trap is quoting bandwidth without converting to packet rate and CPU cycles.
Likely follow-ups
  • What changes for 1500-byte packets?
  • Why is one cache miss already expensive here?
  • How do RSS and multiple queues change the estimate?
  • What work would you push into hardware?
Explain kernel bypass, DMA, and a descriptor ring to a non-expert.

How I would approach it: I would use a plain analogy but keep the mechanics accurate. I would avoid overselling kernel bypass as magic; it trades kernel services for lower overhead and more responsibility in the application or runtime.

My spoken answer would be: "Normally, network data passes through the operating system's networking stack before the application receives it. That is flexible and safe, but it adds overhead: system calls, copies or references through kernel structures, scheduling, and general-purpose protocol processing. Kernel bypass is a way for a trusted application to talk more directly to the NIC through a controlled interface, so packet send and receive can avoid much of that generic path."

"DMA means direct memory access. Instead of the CPU copying every byte from the NIC, the CPU sets up memory buffers and tells the device where they are. The NIC then reads or writes those buffers directly over PCIe. The CPU still controls ownership and correctness; it just does not move every byte itself."

"A descriptor ring is the shared checklist between software and the NIC. Software fills descriptors that say, roughly, 'here is a buffer address, here is its length, here are flags'. The device consumes those descriptors and later reports completions. On receive, software posts empty buffers and the NIC fills them with packets. Producer and consumer indices move around a circular array. Most bugs are ownership bugs: did software reuse a buffer before the device was done, did the device see a descriptor before it was fully written, did software read a packet before DMA was complete?"

I would add that AMD/Solarflare history makes this especially relevant because Onload and ef_vi are examples of low-latency user-level networking interfaces. I have not shipped with those APIs, but the descriptor/DMA ownership model is exactly the kind of hardware/software contract I am used to reasoning about.

What they're listening for: They are testing whether you can explain deep systems simply without becoming inaccurate. The trap is saying bypass means no kernel at all or DMA means no CPU involvement.
Likely follow-ups
  • What safety problems does kernel bypass create?
  • Who owns a receive buffer before and after DMA?
  • Why are memory barriers needed around descriptor rings?
  • When would you not use kernel bypass?
Design a flow or connection table for fast packet lookup.

How I would approach it: I would first define the key, update rate, lifetime, and concurrency model. A TCP/UDP five-tuple table for fast lookup has different constraints from a slow control-plane table or a learned MAC table.

My spoken answer would be: "For a packet hot path I would start with a hash table keyed by the flow tuple: source and destination addresses, ports, protocol, and maybe VLAN or namespace depending on the product. I would use a strong enough hash to avoid obvious clustering, then store entries in a cache-conscious layout. The lookup path should touch as few cache lines as possible."

Collision handling is the core tradeoff. Chaining is simple but pointer chasing is bad for cache. Open addressing or bucketized hashing can be faster because each bucket stores a small fixed number of compact entries inline. If a bucket overflows, I can use a small overflow area or fall back to a slower path, but I would want metrics because overflow behavior often becomes the tail-latency problem.

I would separate hot fields from cold fields. The hot path needs key fingerprints, state, next action, and maybe counters. Full metadata, debug strings, timestamps and policy details should not sit in the first cache line if every packet lookup pays for them. For deletion and timeout, I would avoid scanning the entire table on the hot path; use epochs, timers, or batched maintenance.

Concurrency depends on the datapath. If flows are sharded by RSS queue, per-queue tables reduce locking and improve locality. If multiple cores share a table, I would consider RCU-style reads, per-bucket locks for updates, or versioned entries. The right choice depends on whether lookups dominate updates. My instinct for low latency is to partition first, then share only when the product needs it.

What they're listening for: They want cache layout, collision strategy, lifetime, and concurrency. The trap is saying 'use a hashmap' without describing the packet-path consequences.
Likely follow-ups
  • How would you handle hash collisions?
  • What fields belong in the first cache line?
  • How would RSS change the design?
  • How do you expire old flows without hurting the hot path?
We want the lowest latency for small messages. What would you reach for and what tradeoffs would you name?

How I would approach it: I would ask what 'lowest latency' means: median, p99, one-way, round-trip, under load, with loss, over one host pair, or across a congested cluster. Then I would choose mechanisms that remove software overhead and queueing, while being honest about CPU cost and operational complexity.

My spoken answer would be: "For very small messages, I would first reduce crossings and queueing. I would look at kernel bypass or a user-level networking API where appropriate, polling instead of interrupt-driven receive for the latency-critical path, CPU pinning, queue affinity, NUMA-local memory, preallocated buffers, and avoiding allocation or locks in the send/receive path. I would tune interrupt moderation carefully or avoid interrupts for the hot queue."

I would also think about protocol choice. UDP or a reliable user-level transport can beat a general TCP path for some patterns, but then the application or library owns loss, ordering, congestion behavior, and recovery. TCP gives a mature stack and fairness but may add overhead and tail behavior depending on settings and workload. RDMA-like approaches can reduce CPU involvement, but they bring memory registration, queue-pair scaling and operational constraints.

The tradeoffs I would name are:

  • Polling improves latency but burns cores.
  • Batching improves throughput but can hurt single-message latency.
  • Kernel bypass reduces overhead but moves safety, isolation and tooling responsibilities.
  • Larger rings absorb bursts but can hide queueing and increase tail latency.
  • Offloads help if they match the workload, but can add complexity or bad tail cases.

My embedded background helps because I am used to trading CPU budget, determinism and buffering. I would not claim one universal answer; I would ask for the workload and measure p50, p99 and p999 before declaring victory.

What they're listening for: They want tradeoff language, not a single buzzword. The trap is answering only 'kernel bypass' without discussing polling, affinity, batching, tail latency and correctness.
Likely follow-ups
  • When would polling be unacceptable?
  • How can batching hurt latency?
  • What metrics would you measure?
  • What would you tune first on a dual-socket server?
A latency histogram shows rare 200 microsecond spikes. How do you investigate?

How I would approach it: I would treat tail latency as a correlation problem. Average throughput can look perfect while a rare scheduler event, interrupt storm, cache miss pattern, reset path, or queue buildup creates the spike.

My spoken answer would be: "First I would verify the measurement. Where are timestamps taken, are clocks synchronized, is the tool itself allocating or logging in the hot path, and do spikes correlate with packet size or burst arrival? Then I would line up system events with spike times: CPU migrations, interrupts on the wrong core, softirq backlog, NAPI budget exhaustion, page faults, frequency scaling, NUMA remote memory, lock contention, firmware events, PCIe errors, or link-level counters."

I would try to reduce variables. Pin the process and queues, isolate CPUs if appropriate, disable power-saving states for a controlled test, keep memory local, use preallocated buffers, and run a known traffic pattern. If the spike disappears, reintroduce variables one by one. If it remains, instrument the datapath boundaries: application timestamp, enqueue, doorbell, completion, RX poll, handoff to application.

I would be careful with instrumentation. A printf in the hot path can create or hide the spike. I prefer counters, sampled tracing, per-CPU buffers, or hardware timestamps if available.

The bridge to my background is the same as PHY/firmware timing debug: rare timing bugs usually require narrowing the observation point until the latency appears between two specific boundaries. Once the spike is localized, the fix becomes much less speculative.

What they're listening for: Tail latency questions test systems discipline. The trap is tuning random knobs without proving where the time is spent.
Likely follow-ups
  • How do you avoid measurement perturbation?
  • What OS settings can affect low-latency networking?
  • How would hardware timestamps help?
  • What if throughput improves but p99 gets worse?
Design the receive path for a NIC at a high level. What are the main moving parts?

How I would approach it: I would draw the ownership path from empty buffer to application-visible packet, then add concurrency and recovery. I would not start with API details; I would start with who owns memory at each step.

My spoken answer would be: "Software allocates or recycles receive buffers and maps them for DMA. It posts descriptors to an RX ring and then publishes the producer index or doorbell so the NIC can use them. The NIC receives packets from the wire, chooses a queue, DMA-writes packet data into posted buffers, writes completion or updates descriptor status, and raises an interrupt or relies on polling. The driver polls completions, unmaps or syncs memory as needed, builds the OS packet representation or hands the buffer to a bypass path, replenishes the ring with fresh buffers, and updates stats."

The design risks are ownership and backpressure. If software does not replenish buffers fast enough, the NIC drops. If software reads the packet before DMA completion is visible, it sees stale data. If it reuses a buffer too early, the packet is corrupted. If all traffic lands on one queue, one CPU becomes the bottleneck even when total CPU looks available.

I would also call out scaling: RSS spreads flows across queues; each queue should ideally map to a CPU with local memory; interrupt moderation and polling control the latency/CPU tradeoff; counters must distinguish no-buffer drops from checksum errors, DMA errors, and stack drops.

I have not owned a Linux net_device RX path in production, so I would verify the exact kernel APIs. But the receive path as a hardware/software contract is very familiar: pre-post memory, device writes into it, software consumes completion, and both sides need precise ordering.

What they're listening for: They want an end-to-end model, not just 'packet enters driver'. The trap is skipping DMA ownership or ring replenishment.
Likely follow-ups
  • Where can packets be dropped in this path?
  • What does RSS buy you?
  • Why does ring replenishment matter?
  • What statistics would you expose?
A device/firmware register sometimes contains stale-looking state. How would you debug the hardware/software interface?

How I would approach it: I would frame it as an interface contract bug until proven otherwise. The question is whether software read the wrong thing, read at the wrong time, failed to publish something, or misunderstood what the hardware guarantees.

My spoken answer would be: "I would first write down the expected state machine. Who writes the register, who clears it, is it level or edge, are bits write-one-to-clear, are reads destructive, and is there any shadowed or posted-write behavior? Then I would add timestamps or sequence numbers around the boundary: before software writes, after the write is posted or flushed if needed, when firmware observes it, when firmware updates status, and when software reads it back."

I would check ordering carefully. MMIO writes may need a readback or barrier depending on the platform and accessor semantics. DMA-visible state has a different contract from MMIO-visible state. If there is an interrupt involved, I would check whether the interrupt can arrive before all status is visible, or whether software clears the cause before draining the condition.

I would also reduce the repro. Single queue, one command, one completion, no batching, then add stress. If the bug only appears under load, I would look for wraparound, missed ownership transitions, counter races, and ring full/empty ambiguity.

This is one of the strongest bridges from my MediaTek work. The MCU-DSP issue was exactly this kind of boundary problem: the software symptom only made sense after we clarified timing, ownership and which side had authoritative state. I would bring that discipline here while learning the NIC-specific registers and Linux accessors.

What they're listening for: They are probing hardware/software seam instincts. The trap is treating a stale register as just a polling loop problem instead of an ordering or contract problem.
Likely follow-ups
  • What is a write-one-to-clear register?
  • Why might an MMIO readback matter?
  • How do posted writes complicate debugging?
  • How would you make the repro smaller?
How would you design a small command/completion queue between driver and firmware?

How I would approach it: I would define command format, ownership bits, producer/consumer indices, completion matching, timeout behavior, and reset recovery. I would keep the queue fixed-size and observable.

My spoken answer would be: "I would use a ring of command descriptors and a ring or area for completions. Each command has an opcode, length, flags, sequence id, status field, and either inline parameters or a pointer to a DMA buffer for larger payloads. Software fills the command completely, makes it visible with the right ordering, then advances the producer index or rings a doorbell. Firmware consumes commands, validates them, performs the work, writes a completion with the sequence id and status, then advances its completion producer."

The sequence id matters because completions can be delayed, retried, or observed during recovery. The timeout policy matters because firmware bugs or reset events cannot leave the driver waiting forever. I would include counters for submitted, completed, timed out, invalid opcode, queue full, and reset-dropped commands.

For concurrency, I would decide whether only one thread submits commands or whether a lock protects submission. If completions arrive in interrupt context, completion handling must be bounded and defer slow work. During reset, the design needs a generation number or state flag so old completions are not mistaken for current commands.

This is close to my real experience: embedded firmware work often has command/state handoffs between processors or blocks. I would be honest that the Linux driver APIs are the part I am ramping on, but the queue contract itself is not foreign.

What they're listening for: A good mini-design covers format, matching, timeouts, reset, and observability. The trap is only drawing a ring and forgetting failure modes.
Likely follow-ups
  • How do you handle a timed-out command?
  • What happens during device reset?
  • Why include sequence ids?
  • What statistics would make this supportable?
You are given a slow packet path. How do you decide whether to optimize data structures, algorithms, or hardware settings first?

How I would approach it: I would profile before optimizing, but I would profile at the right granularity. In packet paths, the cost can be CPU instructions, cache misses, queueing, interrupts, DMA behavior, PCIe transactions, locks, or application backpressure.

My spoken answer would be: "I would first define the metric: throughput, median latency, p99, CPU per Gbit, or packet drops. Then I would locate the bottleneck. If CPU cycles are high, I would look at instruction count, branches, cache misses, allocations and locks. If CPU is low but drops happen, I would look at ring sizing, interrupt moderation, NAPI budget, hardware counters, or application receive rate. If latency is bad only under burst, I would look for queueing rather than raw computation."

Data structure changes are attractive when lookup or per-packet state is hot: flow tables, buffer pools, descriptor metadata, and free lists. Algorithm changes matter when the same work repeats per packet and can be batched, offloaded, or moved out of the hot path. Hardware settings matter when queue steering, interrupt moderation, offloads, MTU, or PCIe/NUMA placement are wrong.

I would try small controlled experiments. Pin queues, change moderation, disable one offload, vary packet size, test one queue versus many, then inspect counters. A change that improves average throughput but worsens p99 may be unacceptable for this team.

My answer would stay humble: I know this role's Ethernet/Linux details are an adjacent domain for me, so I would lean on existing driver tooling and team practice. But I would not optimize blindly. I would use the same discipline I used in DSP firmware: convert the symptom into a measurement, isolate the boundary, change one thing, and verify.

What they're listening for: They want prioritization and measurement, not premature cleverness. The trap is jumping to micro-optimizing C before proving the bottleneck.
Likely follow-ups
  • What if `perf` says the CPU is mostly idle?
  • How do you distinguish queueing latency from compute latency?
  • When is batching the wrong optimization?
  • What experiment would you run first?
How would you explain your wireless-PHY background as relevant in a live technical scenario without overclaiming?

How I would approach it: I would be explicit that wireless PHY and Ethernet NIC drivers are different domains, then connect them through engineering invariants rather than pretending the protocols are the same.

My spoken answer would be: "I would not say that 3GPP PHY work makes me already a Linux NIC driver engineer. The honest claim is narrower and stronger: I have built and debugged embedded C software where correctness depends on timing, fixed memory, hardware-visible state, queues, interrupts or events, and precise ownership between processing blocks. Those same categories appear in NIC work even though the protocol stack changes."

Then I would anchor it. "For UL-DAI, I had to translate a specification into C behavior under firmware constraints. For TX DSP firmware and the MCU-DSP bug, I had to reason across a boundary where the symptom could be software, hardware state, timing, or an interface assumption. For the NTN PoC and test automation, I learned new communications constraints quickly and turned them into repeatable validation."

I would then return to the asked scenario. If the prompt is a ring buffer, I would say the transferable part is producer/consumer ownership and bounded memory. If it is dropped packets, the transferable part is structured triage across boundaries. If it is 100GbE timing, the transferable part is converting rate into time budget and ruling out expensive operations.

That keeps the bridge honest: same engineer, adjacent domain. I am not selling protocol depth I do not have; I am selling the hardware/software reasoning that lets me ramp into the protocol depth quickly.

What they're listening for: This is a meta-scenario they may implicitly test throughout. The trap is either underselling yourself or claiming wireless PHY is basically Ethernet.
Likely follow-ups
  • Which part of the role is most new for you?
  • Which part transfers most directly?
  • How would you close the Ethernet/TCP-IP gap?
  • What project from your CV best proves the transfer?
A TX ring occasionally stops making progress. How would you debug whether it is a software ring bug, a DMA visibility issue, or a device/firmware issue?

How I would approach it: I would start by making the ownership state visible. A stuck TX ring is usually not solved by guessing; I need to know which side thinks it owns each descriptor.

My spoken answer would be: "I would capture the ring state: software producer index, software consumer or clean index, hardware consumer if exposed, descriptor status, pending completions, queue enable state, interrupt or event counters, and whether the doorbell was written after the descriptors were prepared. Then I would check whether the packet buffers and descriptors are still valid and DMA-mapped."

I would split the hypotheses.

  • If software thinks the ring is full but hardware has consumed descriptors, the completion or clean path may be broken.
  • If software posts descriptors but hardware never sees them, I would suspect doorbell, queue state, DMA mapping, ordering, or a device reset condition.
  • If hardware sees descriptors but reports errors, I would inspect descriptor format, DMA address, length, flags and offload metadata.
  • If completions exist but no interrupt fires, I would check interrupt moderation, MSI-X vector setup, NAPI scheduling and event masking.

For ordering, I would be careful: descriptor writes must be visible before the doorbell. If a posted MMIO doorbell creates uncertainty, I would use the driver's established accessor pattern or a readback where appropriate rather than inventing a new rule.

I would connect this honestly to my background: I have not debugged a production Linux NIC TX ring, but I have debugged hardware/software progress bugs where two processors or blocks disagreed about state. My instinct is the same: draw the ownership timeline, capture counters at each boundary, and reduce batching until the first missing transition is obvious.

What they're listening for: This covers a high-value NIC scenario that blends ring buffers, descriptors, doorbells, interrupts, DMA ordering and honest-gap framing.
Likely follow-ups
  • What counters would you add?
  • How could a reset path create this symptom?
  • How would you prove the doorbell was observed?
A packet parser works in unit tests but fails on customer captures with VLANs or short frames. How would you harden it?

How I would approach it: I would assume the parser trusted an implicit layout that real traffic violated. Packet code must be hostile to its input, even if the packets usually come from friendly lab tools.

My spoken answer would be: "I would first reproduce with the smallest failing capture and print or trace the offset, remaining length, EtherType, VLAN depth, IP version and IHL at each parse step. Then I would audit every read: before reading N bytes, prove len - off >= N; before advancing an offset, prove it cannot overflow; before reading an IPv4 field, prove the header is present and ihl >= 5."

The likely fixes are:

  • Parse big-endian fields explicitly or through endian helpers.
  • Avoid unaligned struct casts from uint8_t *.
  • Handle one VLAN tag if required, and decide deliberately whether QinQ or stacked tags are supported or rejected.
  • Reject runt frames and malformed IPv4 headers cleanly.
  • Add tests from real captures: short Ethernet, VLAN IPv4, VLAN non-IPv4, IPv4 options, truncated IPv4, and unknown EtherType.

I would also check whether the parser is used on a hot path. If it is, I would keep the hardening branch-light and avoid copying packet bytes. Correctness comes first, but in a NIC datapath I should be aware that every extra read, branch or cache miss can matter.

This is a good place to bridge from my CV without overclaiming. I have worked from protocol specifications into C and tests in 3GPP PHY; I have not shipped Ethernet parser code in a NIC driver. The transferable habit is treating the protocol as a byte-level contract and building tests around malformed edge cases.

What they're listening for: This broadens packet-parse coverage from writing a function to debugging a real parser failure with VLAN, length, endian and alignment traps.
Likely follow-ups
  • What customer capture would you ask for?
  • How do you avoid parser overrun?
  • What would you test after the fix?
Ping works, but throughput is terrible. How do you debug it?

How I would approach it: I would start by saying that ping proves only a small ICMP packet can make a round trip. It does not prove TCP throughput, MTU, offloads, queue steering, interrupt behavior, or loss under load are healthy.

My spoken answer would be: "First I would define terrible: is it low TCP goodput, high retransmits, timeouts, high CPU, one direction only, one host pair, or only one packet size? Then I would run a controlled test with known packet sizes and collect counters before and after, so I know whether I am chasing the network path, the host datapath, or the application."

I would split it this way:

  • MTU and MSS: try ping -M do -s sizes or equivalent path-MTU checks, inspect interface MTU, VLAN or tunnel overhead, and TCP MSS. A black-holed PMTUD path can let small pings work while large TCP segments stall or fragment badly.
  • TCP behavior: check retransmits, RTT, receive window, window scaling, congestion window, and whether TCP_NODELAY or Nagle matters for small request/response traffic. Nagle is not usually the cause of bulk throughput collapse, but it can make small-message latency look terrible.
  • Offloads: compare TSO/GSO/GRO/LRO and checksum offload settings. Disabling offloads can make CPU become the bottleneck; a broken or mismatched offload can corrupt the measurement.
  • Interrupt versus poll: inspect coalescing, NAPI or polling behavior, and CPU load. Too much interrupt rate hurts throughput; too much coalescing can improve throughput while hurting tail latency.
  • Queues and CPU placement: look for a single RX or TX queue doing all the work, bad RSS hash coverage, IRQs on the wrong CPU, application pinned away from the queue, or remote NUMA memory.
  • Loss and retransmit: inspect NIC, driver, switch and TCP counters for drops, CRC/FEC errors, pause frames, retransmits and out-of-order packets. Even tiny loss rates can destroy TCP throughput.

The practical command shape I would use is ip -s link, ethtool -S, ethtool -k, ethtool -c, ethtool -x, /proc/interrupts, ss -ti, and a clean iperf3 or application-level repro with counter deltas. I would change one variable at a time: MTU, offloads, queue affinity, coalescing, CPU pinning, then packet size and stream count.

The honest bridge is that I have studied and traced this as a NIC/Linux debugging workflow rather than shipped a production Linux NIC driver. What I can bring from MediaTek is the triage discipline: with UL-DAI and TX DSP firmware issues, and with 100+ customer/internal cases, I learned not to tune blindly. I locate the first boundary where the evidence changes, prove the mechanism, then fix the right layer.

What they're listening for: This is a first-screen favourite because it checks practical network debugging structure. The trap is saying only 'MTU mismatch' or only 'use iperf' without reasoning across TCP, offloads, queues, CPU, interrupts and loss.
Likely follow-ups
  • Why can small pings work when TCP throughput collapses?
  • Which counters would show packet loss versus host CPU bottleneck?
  • How can disabling offloads make a benchmark worse?
  • What would one hot RX queue tell you?