๐งชThe First Technical Screen
A mock first technical screen for the former Solarflare/AMD networking team: show calm low-level C, packet-path instincts, and a credible PHY/embedded-to-NIC pivot.
Let's start with a C warm-up. What does this print, and what are the traps: `struct s { char c; int x; short y; };`?
I'd avoid guessing an exact size without the target ABI, but on a typical 64-bit Linux ABI where int is 4-byte aligned and short is 2-byte aligned, the layout is usually:
cat offset 0- 3 bytes of padding
xat offset 4yat offset 8- then 2 bytes of tail padding so arrays of the struct keep
xaligned
So sizeof(struct s) is likely 12, not 7 or 10.
The way I'd verify is with sizeof and offsetof, not by assuming:
#include <stddef.h>
#include <stdio.h>
struct s { char c; int x; short y; };
printf("size=%zu c=%zu x=%zu y=%zu\n",
sizeof(struct s), offsetof(struct s, c),
offsetof(struct s, x), offsetof(struct s, y));
The networking relevance is that I would not cast arbitrary packet bytes to this struct and trust the compiler layout. For hardware descriptors or wire formats, I want explicit widths like uint32_t, documented alignment, endian conversion, and usually parse from bytes or use packed structures only where the ABI contract really requires it. Packed structs can create unaligned accesses and slower code, so I use them deliberately.
- How would you make the layout stable for a hardware descriptor?
- Why can casting network bytes to a struct be unsafe?
- What is tail padding and why does it matter for arrays?
Reorder that struct to shrink it. How small can it get, and what's the general rule?
If I put the widest members first, the padding mostly disappears:
struct s_packed_well {
int x; /* offset 0, 4 bytes */
short y; /* offset 4, 2 bytes */
char c; /* offset 6, 1 byte */
}; /* 1 byte tail pad -> sizeof 8 */
So it drops from 12 to 8 just by ordering members largest-alignment first. The general rule: a member's offset must be a multiple of its alignment, and the struct's total size is rounded up to its largest member's alignment so arrays stay aligned. Reordering large-to-small minimises the internal holes.
I'd say out loud that this matters in driver work for two reasons. One, cache footprint: a smaller hot descriptor or per-packet struct means more of them per cache line, which is real at line rate. Two, I would not reach for __attribute__((packed)) just to save bytes, because that forces the compiler to assume unaligned access and can generate slower byte-by-byte loads on some targets. I pack only when an external contract, a wire format or a hardware register block, dictates the exact layout, and then I'm explicit about endianness too.
- When is `packed` actually the right call?
- How does struct size interact with cache lines on a hot path?
- What does `_Alignof` tell you and when would you use `_Alignas`?
What does `volatile` do in C, and what does it not do?
volatile tells the compiler that the object can change in ways it cannot see, so loads and stores to that object must actually happen as observable accesses. That is useful for memory-mapped device registers, status flags updated by hardware, or some very constrained embedded cases.
What it does not do is at least as important:
- It does not make an operation atomic.
- It does not make code thread-safe.
- It does not provide cache coherency.
- It does not create the full memory ordering you need between CPU cores or between CPU and device DMA.
- It does not replace locks, atomics, or kernel primitives.
In an embedded PHY setting I would use volatile for MMIO registers, but if I were coordinating producer/consumer state between CPU and hardware, I'd also think about barriers and ownership. In Linux driver code that means using the kernel's MMIO accessors like readl()/writel() rather than raw volatile pointer dereferences, and using the right DMA and memory barrier primitives around descriptor rings.
So my short answer is: volatile prevents certain compiler optimisations on the access itself; it is not a synchronization primitive.
- When would you use `readl()` or `writel()` instead of a volatile pointer?
- What is the difference between compiler reordering and CPU memory ordering?
- How would you publish a descriptor to a device safely?
Write a small function to test whether bit `n` is set in a 32-bit value, and then set or clear it.
I'd write it with unsigned types and guard the shift range, because shifting by 32 or more on a 32-bit value is undefined behaviour.
#include <stdbool.h>
#include <stdint.h>
bool bit_is_set(uint32_t v, unsigned n)
{
return n < 32 && ((v & (UINT32_C(1) << n)) != 0);
}
uint32_t bit_assign(uint32_t v, unsigned n, bool set)
{
if (n >= 32)
return v;
uint32_t mask = UINT32_C(1) << n;
return set ? (v | mask) : (v & ~mask);
}
Thinking aloud, I would say: use uint32_t, use UINT32_C(1) so the shift is on the intended width, check n, and avoid signed shifts. If this were a shared register or flag word, I would then ask whether it needs atomic read-modify-write or whether the caller already holds the lock. In driver work, that distinction matters: a correct bit operation in single-threaded C can still be wrong if an interrupt handler or another CPU updates the same word.
- What happens if `n == 32`?
- Would this be safe if two threads call it on the same variable?
- How would you count set bits efficiently?
Here's some code. What does it print? `uint8_t a = 0xFF; if (~a == 0xFF00) puts("hi");`
This is the integer-promotion trap, and I'd reason it out rather than guess. In C, before ~ is applied, a is promoted to `int` because all uint8_t values fit in int. So a becomes the int value 255 (0x000000FF), and ~a is 0xFFFFFF00, which as a signed int is -256.
The comparison ~a == 0xFF00 is then -256 == 65280, which is false, so it prints nothing. The naive expectation, that ~0xFF on a byte gives 0x00 so the test compares against 0xFF00 and is false anyway, happens to reach the same verdict here but for the wrong reason, and the wrong reason bites you elsewhere.
The rule I keep in mind: any operand narrower than `int` is promoted to `int` before arithmetic or bitwise operators. So ~, <<, * on uint8_t/uint16_t operate on a 32-bit signed value, not the narrow type. The fix when I want byte-width semantics is to mask back explicitly:
uint8_t r = (uint8_t)(~a); /* now 0x00 as a byte */
This matters in packet and register code constantly, for example building a one's-complement checksum from uint16_t words, or computing ~mask on a small field. I always cast or mask back to the intended width so promotion doesn't smuggle in sign bits.
- Why does `uint16_t x = 0xFFFF; x = x << 4;` not do what people expect on the value before assignment?
- Where does this bite you in a checksum loop?
- What's the rule for when promotion goes to `int` vs `unsigned int`?
Spot the bug: `uint32_t reg = readl(base); reg & ~(0xF << 4); reg |= (val << 4); writel(reg, base);`
There's a missing assignment and a latent width concern. Line two, reg & ~(0xF << 4);, computes the cleared value and throws it away, it's a statement with no effect. A decent compiler warns about an expression with no side effect. So the field is never actually cleared, and then reg |= (val << 4) ORs the new bits on top of the old field, giving a corrupted mix. The fix:
uint32_t reg = readl(base);
reg &= ~(0xFu << 4); /* clear the 4-bit field */
reg |= (val & 0xFu) << 4; /* insert masked value */
writel(reg, base);
I made two other corrections out loud. One, I mask val with & 0xFu before shifting so a too-large val can't spill into neighbouring fields, that's a real register-corruption bug. Two, I'd use 0xFu (unsigned) so the ~ and shifts aren't done on a signed int; ~(0xF << 4) on signed int is technically fine on common ABIs but I prefer unsigned literals in register code as a habit.
The pattern itself, read-modify-write a sub-field, is exactly what you do programming a NIC's control register or a descriptor field, so getting clear-then-set right and bounding the inserted value is the whole point. And if another context can touch this register concurrently, the read-modify-write needs locking or a hardware atomic, because writel of a stale-read value clobbers a concurrent update.
- Why mask `val` before shifting it into place?
- What if an ISR writes the same register between the read and the write?
- How would you write a generic 'set field' helper with a mask and shift?
Suppose I give you a `uint8_t *buf` and a length. Parse the Ethernet header and, if it's IPv4, return the IP protocol byte. Talk me through it.
I'd avoid casting the buffer to C structs because the buffer may be unaligned and network fields are big-endian. I'd do explicit length checks before every read.
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#define ETH_HLEN 14
#define ETH_P_IP 0x0800
static uint16_t get_be16(const uint8_t *p)
{
return ((uint16_t)p[0] << 8) | p[1];
}
bool eth_ipv4_proto(const uint8_t *buf, size_t len, uint8_t *proto_out)
{
if (!buf || !proto_out || len < ETH_HLEN)
return false;
uint16_t ethertype = get_be16(buf + 12);
size_t off = ETH_HLEN;
if (ethertype != ETH_P_IP)
return false;
if (len < off + 20)
return false;
uint8_t ver_ihl = buf[off];
uint8_t version = ver_ihl >> 4;
uint8_t ihl = ver_ihl & 0x0f;
if (version != 4 || ihl < 5)
return false;
size_t ip_hlen = (size_t)ihl * 4;
if (len < off + ip_hlen)
return false;
*proto_out = buf[off + 9];
return true;
}
Out loud, I'd say: first prove the Ethernet header exists, read EtherType as big-endian, check IPv4, prove the minimum IPv4 header exists, validate version and IHL, then only read the protocol field. In a production NIC path I'd also ask whether VLAN tags are in scope, because then the EtherType may be behind one or more 802.1Q headers. For a first screen, I'd mention that and keep the core parser correct.
- How would VLAN tags change this?
- Where is the IPv4 protocol field?
- Why is `ntohs(*(uint16_t *)(buf + 12))` risky?
Extend that packet parser to handle one VLAN tag. What changes?
I would treat VLAN as another bounded parse step. The first EtherType may be 0x8100 for 802.1Q or 0x88a8 for provider bridging. With one VLAN tag, there are four extra bytes: TCI then the encapsulated EtherType.
#define ETH_P_8021Q 0x8100
#define ETH_P_8021AD 0x88a8
bool eth_ipv4_proto_one_vlan(const uint8_t *buf, size_t len, uint8_t *proto_out)
{
if (!buf || !proto_out || len < 14)
return false;
size_t off = 14;
uint16_t ethertype = get_be16(buf + 12);
if (ethertype == ETH_P_8021Q || ethertype == ETH_P_8021AD) {
if (len < off + 4)
return false;
ethertype = get_be16(buf + off + 2);
off += 4;
}
if (ethertype != ETH_P_IP || len < off + 20)
return false;
uint8_t ver_ihl = buf[off];
uint8_t ihl = ver_ihl & 0x0f;
if ((ver_ihl >> 4) != 4 || ihl < 5)
return false;
if (len < off + (size_t)ihl * 4)
return false;
*proto_out = buf[off + 9];
return true;
}
The main point is that every time the offset moves, I re-check that the bytes exist before touching them. That is exactly the habit I developed in embedded work: assume malformed input exists, keep the state machine explicit, and do not let a fast path become a memory-safety bug.
- How would you handle stacked VLAN tags?
- What is the difference between EtherType and length in Ethernet II versus older framing?
- Would you drop malformed packets or pass an error upward?
Write the IPv4 header checksum. How does one's-complement folding work, and how do you verify a header?
The IPv4 header checksum is the 16-bit one's complement of the one's-complement sum of the header treated as 16-bit big-endian words, with the checksum field itself taken as zero during computation. I accumulate into a 32-bit value so carries collect in the high half, then fold them back into the low 16 bits, then invert.
#include <stdint.h>
#include <stddef.h>
uint16_t ip_checksum(const uint8_t *hdr, size_t len)
{
uint32_t sum = 0;
size_t i;
for (i = 0; i + 1 < len; i += 2)
sum += ((uint32_t)hdr[i] << 8) | hdr[i + 1];
if (i < len) /* odd trailing byte, pad low */
sum += (uint32_t)hdr[i] << 8;
while (sum >> 16) /* fold carries into low 16 */
sum = (sum & 0xFFFF) + (sum >> 16);
return (uint16_t)~sum;
}
Two things I'd say out loud. The folding loop, sum = (sum & 0xFFFF) + (sum >> 16), runs until the high half is zero, because the fold can itself carry. And the elegant part for verification: because a value plus its one's complement is all ones, if I run the same sum over the whole header including the existing checksum field, a valid header sums to 0xFFFF. So my validator is the identical loop, and I just check the folded result is 0xFFFF rather than inverting.
This is also a good example of where integer promotion would bite if I used uint16_t for the accumulator, the carries would be lost, so the 32-bit sum is deliberate. The same one's-complement machinery underlies the TCP and UDP checksums over their pseudo-header, which is why NICs offload it.
- Why accumulate in 32 bits instead of 16?
- How does the TCP checksum differ from this?
- Why does a valid header sum to 0xFFFF when you include the checksum field?
Type punning: I want to read a big-endian `uint32_t` out of a byte buffer. Is `*(uint32_t *)(buf + off)` okay?
No, I'd avoid that, for two independent reasons.
First, alignment: buf + off may not be 4-byte aligned, and an unaligned uint32_t load is undefined behaviour in C and faults or runs slowly on some architectures. Second, strict aliasing: accessing memory that isn't really a uint32_t object through a uint32_t * violates the aliasing rules, so the compiler is allowed to optimise on the assumption it never happens, which produces 'works at -O0, breaks at -O2' bugs. It also doesn't fix endianness, the bytes are still in memory order.
The robust way is to read bytes and assemble, which is alignment-safe, aliasing-safe, and makes byte order explicit:
static uint32_t load_be32(const uint8_t *p)
{
return ((uint32_t)p[0] << 24) |
((uint32_t)p[1] << 16) |
((uint32_t)p[2] << 8) |
(uint32_t)p[3];
}
If I genuinely had a correctly-aligned object and just wanted to reinterpret its bytes, the blessed escape hatch is memcpy into a uint32_t (the compiler turns that into a single load when it's safe), or accessing through a character type, which is the one pointer type allowed to alias anything. But for parsing arbitrary packet bytes I default to the byte-assembly helper, then ntohl/be32_to_cpu semantics fall out for free because I built the value big-endian-first.
- Why does `memcpy` avoid the aliasing problem?
- Which pointer type is allowed to alias any object?
- When is the cast actually safe?
At screen depth, explain how a NIC receive ring works.
A receive ring is a circular array of descriptors shared between the driver and the NIC. The driver allocates packet buffers, maps them for DMA, and posts descriptors saying, in effect, 'device may write a packet into this buffer'. The NIC DMAs received packets into those buffers and updates completion or ownership information. The driver polls or handles an interrupt, finds completed descriptors, unmaps or syncs the DMA memory as required, builds or updates the OS packet object, passes the packet upward, then replenishes the ring with fresh buffers.
The key details I would keep straight are:
- There is a producer/consumer relationship, but the producer depends on direction: driver produces free RX buffers; NIC produces completed packets.
- The ring index wraps, usually with power-of-two sizing and a mask.
- Ownership must be explicit so CPU and device do not touch the same descriptor at the wrong time.
- Memory ordering matters: descriptor contents must be visible before the doorbell or tail pointer is updated.
- DMA mapping is not just a pointer cast; the device sees DMA addresses, and the IOMMU/cache-coherency rules matter.
From my wireless/PHY background, the shape is familiar: bounded queues, fixed latency budgets, hardware/firmware/software ownership, and careful buffer lifetime. The Ethernet-specific details are new domain surface area, but the real-time HW/SW contract is the same kind of engineering.
- What happens if the RX ring runs out of buffers?
- Where do memory barriers fit?
- What is the difference between descriptors and packet buffers?
Can you sketch a simple single-producer/single-consumer ring buffer in C?
For a screen, I'd keep it single-producer/single-consumer and power-of-two sized. I'd also leave one slot empty so full and empty are easy to distinguish.
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#define RING_SIZE 1024u
#define RING_MASK (RING_SIZE - 1u)
struct ring {
void *slot[RING_SIZE];
uint32_t head; /* consumer reads here */
uint32_t tail; /* producer writes here */
};
bool ring_push(struct ring *r, void *p)
{
uint32_t tail = r->tail;
uint32_t next = (tail + 1u) & RING_MASK;
if (next == r->head)
return false;
r->slot[tail] = p;
r->tail = next;
return true;
}
bool ring_pop(struct ring *r, void **p)
{
uint32_t head = r->head;
if (head == r->tail)
return false;
*p = r->slot[head];
r->head = (head + 1u) & RING_MASK;
return true;
}
Then I'd state the assumptions clearly: this is not a general multi-threaded queue. For true lock-free SPSC between cores, I'd need atomics or kernel primitives with acquire/release ordering so the consumer cannot see the updated tail before the slot contents are visible. For a NIC descriptor ring, the same idea appears with device-visible descriptors, DMA ordering, and doorbells.
- How do you distinguish full from empty?
- Why use a power-of-two ring?
- What changes for multiple producers?
Make that SPSC ring actually safe across two threads. What's the minimum you change?
The plain version has a real ordering bug on a weakly-ordered CPU: the consumer could observe the bumped tail before the slot write lands, and read a stale pointer. The minimum fix is to make head and tail atomics and use release on publish, acquire on observe, so the data write is ordered before the index becomes visible.
#include <stdatomic.h>
#include <stdbool.h>
#include <stdint.h>
struct ring {
void *slot[RING_SIZE];
_Atomic uint32_t head;
_Atomic uint32_t tail;
};
bool ring_push(struct ring *r, void *p)
{
uint32_t tail = atomic_load_explicit(&r->tail, memory_order_relaxed);
uint32_t next = (tail + 1u) & RING_MASK;
if (next == atomic_load_explicit(&r->head, memory_order_acquire))
return false;
r->slot[tail] = p;
atomic_store_explicit(&r->tail, next, memory_order_release);
return true;
}
bool ring_pop(struct ring *r, void **p)
{
uint32_t head = atomic_load_explicit(&r->head, memory_order_relaxed);
if (head == atomic_load_explicit(&r->tail, memory_order_acquire))
return false;
*p = r->slot[head];
atomic_store_explicit(&r->head, (head + 1u) & RING_MASK, memory_order_release);
return true;
}
The load of my own index can be relaxed since only I write it; the load of the other side's index is acquire, and my publish is release. I'd also mention the false-sharing angle: head and tail on the same cache line bounce between the two cores, so in production I'd pad them onto separate cache lines. The reason I care about this exercise is that it's the same publish-ordering as a NIC descriptor ring, where dma_wmb() plays the role of the release before the doorbell, and the owner-bit read is the acquire.
- Why can the load of your own index be relaxed?
- What is false sharing and how would you avoid it here?
- How does this map onto a NIC ring's doorbell and owner bit?
Walk me through a packet from the wire to a userspace application on Linux.
At a high level: the Ethernet signal is received by the PHY, decoded and passed through the MAC in the NIC. The NIC classifies or filters enough to decide what to do with it, then DMAs the packet into host memory using buffers the driver posted on the RX ring. It writes completion information saying a packet arrived, often including length, checksum status, RSS hash, timestamp, or offload metadata.
The driver is notified by interrupt or, under load, by NAPI polling. The driver consumes completed RX descriptors, synchronizes DMA memory as needed, wraps the buffer in an skb or equivalent path, and passes it into the Linux networking stack. The stack handles L2/L3/L4 processing: Ethernet, IP, TCP or UDP, routing/filtering, socket lookup, and eventually data becomes available to the userspace process via recv, read, epoll, or an async mechanism.
For a low-latency or high-throughput system, the interesting costs are copies, cache misses, interrupts, batching, queue selection, NUMA placement, and how early the packet can be steered to the right core. That is where my background maps well: in PHY/DSP work I am used to asking where the latency budget is spent, which boundary owns each buffer, and what must happen in hard real time versus what can be batched.
- Where does NAPI fit and why does it exist?
- What are RSS and queue steering for?
- Where can packet copies happen?
Explain kernel bypass simply. Why did Solarflare-style NIC teams care about it?
Kernel bypass means arranging for an application to send and receive packets without taking the normal Linux networking-stack path for every packet. Instead of every packet becoming an skb and going through generic kernel processing, the application talks more directly to NIC queues through a controlled userspace interface, with memory mapped rings and buffers.
The reason teams like Solarflare cared is latency and determinism. The normal kernel path is general-purpose and feature-rich, but it has overhead: interrupts, scheduler effects, socket stack processing, copies or cache disruption, locks, and unpredictable contention. If the workload is market data, storage, or AI-cluster transport where microseconds and tail latency matter, a more direct datapath can be valuable.
I would not describe bypass as 'the kernel is bad'. The kernel gives isolation, compatibility, security, and a huge protocol surface. Bypass is a tradeoff: you take more responsibility in the application and driver interface to get lower overhead and tighter control. That tradeoff is familiar to me from real-time embedded work: sometimes you use the general framework; sometimes the latency budget forces a specialized path.
- What do you lose when bypassing the kernel stack?
- How might memory protection still be maintained?
- How does this relate to DPDK or Onload?
You are stronger in wireless PHY and embedded systems than Ethernet drivers. Why should we believe you can ramp into this role?
I would frame this as an adjacent-domain move, not a career change. The core engineering problem I have worked on is moving bits fast and correctly at the hardware/software boundary under real-time constraints. In wireless PHY that meant timing, buffers, fixed-point/DSP behaviour, hardware-facing C, observability, and careful reasoning when a bug could be in software, firmware, hardware, or the interpretation of a spec.
The domain surface changes here: Ethernet, PCIe, Linux driver conventions, TCP/IP, and NIC datapaths. I am not pretending those are identical to a 3GPP stack. But the habits transfer strongly:
- I am comfortable reading specs and reducing them to testable state machines.
- I debug across layers rather than assuming the fault is in my current file.
- I care about latency, data movement, cache behaviour, and bounded queues.
- I am used to writing C where undefined behaviour or one bad ownership assumption can become a hardware-facing bug.
My ramp plan would be concrete: get productive in the driver build/test flow, learn the team's descriptor and queue model, trace RX/TX from interrupt or polling through the stack, and then take contained bugs where I can build confidence without pretending to know the whole stack on day one. The reason I am interested in this group is that the same low-level instincts now matter in AI infrastructure networking as much as they did in radio systems.
- What would you learn first in your first month?
- Tell me about a time you debugged across hardware and software.
- How much Linux kernel code have you written?
Tell me about a technical bug you found at the hardware/software boundary.
Example to personalize:
Situation: In a wireless PHY integration test, we saw intermittent packet or block failures under a specific timing configuration. The failure looked like a DSP algorithm issue at first, but it only appeared when the embedded control path changed configuration close to a processing boundary.
Task: My job was to isolate whether the fault was in the algorithm, the embedded C control code, the hardware register programming, or the test setup.
Action: I reduced the test to the smallest reproducible timing window, added timestamped instrumentation around the configuration write and processing interrupt, and compared expected versus actual register state. I also checked for signedness and scaling assumptions in the fixed-point path because those are common PHY failure modes. The evidence pointed to an ownership/timing issue: software was updating a control field while the hardware block could still sample a mixed old/new configuration.
Result: We changed the update sequence so the new configuration was staged, synchronized at a safe boundary, and verified with a readback or status transition before enabling the next processing interval. The intermittent failure disappeared in the stress test, and we added a regression case around that timing edge.
How I would connect it here: that is the same class of bug I would expect in NIC work: descriptor ownership, doorbell ordering, interrupt timing, or a register update that is correct in isolation but wrong relative to the device's state machine. My instinct is to make the boundary observable, reduce the timing window, and prove ownership before changing code.
- What instrumentation did you add?
- How did you prove the fix?
- What would you do if the issue only reproduced once a day?
Explain endianness as it applies to Ethernet/IP and a NIC driver.
Endianness is about byte order for multi-byte values. Network protocols conventionally put multi-byte fields on the wire in big-endian, often called network byte order. Many hosts this team works with are little-endian, so if I read two bytes from an Ethernet header or IP header, I must convert deliberately rather than treating the bytes as a host uint16_t.
For example, EtherType bytes 08 00 mean IPv4. On a little-endian host, blindly loading those two bytes as a uint16_t could produce 0x0008 depending on how it is read. The correct options are explicit byte parsing, ntohs-style conversion in userspace, or kernel helpers such as be16_to_cpu when working with __be16 values.
In a NIC driver there may be two endian domains to think about:
- Wire protocol fields, which follow protocol byte order.
- Device descriptors and registers, which follow the hardware specification and may be little-endian even though the packet bytes are network order.
So I would not apply one blanket rule. I would read the hardware spec, use typed endian helpers where available, and keep conversions at the boundary. That prevents the common bug where code works on one CPU architecture or one test packet and fails elsewhere.
- What does `__be16` communicate in Linux code?
- Why can unaligned loads be a problem here too?
- Where would you put endian conversions in a parser?
What's wrong with `memcpy(dst, src, n)` when the regions overlap, and when does that actually happen in packet code?
memcpy requires the source and destination not to overlap, its parameters are even declared restrict. If they overlap, it's undefined behaviour: the implementation is free to copy forward, backward, in word-sized chunks, or vectorised, whichever is fastest, so an overlap can corrupt data in ways that depend on the libc version and the length. The correct tool when regions can overlap is memmove, which is defined to behave as if the source were first copied to a temporary, so it handles overlap in either direction.
Where this bites in packet code is anything that shifts bytes within the same buffer, for example inserting or stripping a header or a VLAN tag in place:
/* strip a 4-byte VLAN tag at offset 12, in place */
memmove(buf + 12, buf + 16, len - 16); /* regions overlap -> memmove */
If I wrote memcpy there it might look fine in a unit test with a short packet and then corrupt larger frames in production. So my rule is: distinct buffers, memcpy; same buffer with a shifting offset, memmove. And on a hot datapath I'd question whether I need the copy at all, since the whole point of a zero-copy design like ef_vi is to avoid moving payload bytes; often you can adjust an offset or a descriptor rather than memmove the data.
- Why is overlap undefined rather than just implementation-defined?
- What does `restrict` promise the compiler?
- How would a zero-copy datapath avoid the move entirely?
Quick one: what does `sizeof(arr) / sizeof(arr[0])` give inside a function that took `arr` as a parameter?
It gives the wrong answer, and this is a trap I'd flag immediately. A function parameter declared as int arr[] or int arr[10] is adjusted to a pointer, int *arr. So inside the function sizeof(arr) is the size of a pointer, 8 on a 64-bit machine, not the array's size. sizeof(arr) / sizeof(arr[0]) then computes 8 / 4 == 2, regardless of the real length, which is a classic source of buffer overruns.
void f(int arr[10]) {
/* sizeof(arr) == 8 (pointer), NOT 40 */
}
int main(void) {
int a[10];
/* sizeof(a) == 40 here, so the idiom works HERE only */
}
The ARRAY_SIZE-style idiom only works in the scope where the real array is declared. Across a function boundary the size information is gone, so I always pass the length explicitly as a separate size_t parameter, exactly the (buf, len) pattern I'd use for every packet-parsing function. This is the same discipline that keeps the parsers safe: never infer a length you weren't handed.
- Why does the array decay to a pointer at all?
- How do you write a safe `ARRAY_SIZE` macro and where is it valid?
- How does this connect to your packet-parsing signatures?
Your SPSC ring uses head and tail indices. How do you distinguish empty from full, and what changes if the ring size is a power of two?
There are a few common designs, and I should name the tradeoff instead of pretending there is only one right answer.
The simplest design leaves one slot empty:
static int ring_empty(const struct ring *r) {
return r->head == r->tail;
}
static int ring_full(const struct ring *r) {
return ((r->head + 1) & (RING_SIZE - 1)) == r->tail;
}
That works cleanly when RING_SIZE is a power of two, because wraparound can use & (RING_SIZE - 1) instead of % RING_SIZE. The cost is capacity: a ring with 1024 entries can only hold 1023 items.
Another design keeps a separate count, but then both producer and consumer may update that shared count, which makes the concurrency story worse. A third design uses monotonically increasing head and tail counters and masks only when indexing the array. Then occupancy is head - tail, assuming unsigned wraparound is handled intentionally and the counter width is large enough.
For an interview, I would start with the one-empty-slot SPSC ring because it is easy to reason about and test. Then I would discuss variants: full-capacity rings, sequence numbers per slot, multi-producer locks or atomics, and descriptor rings where hardware owns some entries. In NIC code the ring is not just a data structure; it is also an ownership contract between CPU and device.
- Why is `%` sometimes avoided in hot rings?
- How would you make it multi-producer?
- What does a sequence number per slot buy you?
In a NIC descriptor ring, what is the difference between a descriptor, a buffer, an index, and a doorbell?
I would separate the nouns clearly.
- The buffer is host memory that holds packet data or receives packet data.
- The descriptor is metadata that tells the NIC where the buffer is, how long it is, and what flags or offload instructions apply.
- The index or producer/consumer pointer says which descriptors are available, pending, or completed.
- The doorbell is usually an MMIO write that tells the device new descriptors are ready or that an index changed.
On TX, software fills the packet buffer, fills the descriptor with the DMA address and length, makes the descriptor visible to the device, then rings the doorbell. Later the device writes a completion or updates state, and software can reclaim the buffer.
On RX, software pre-posts empty buffers through descriptors. The NIC DMAs packet bytes into those buffers, writes completion metadata such as length or checksum status, and software consumes the completion before giving the packet to the stack or a bypass path.
The ordering matters. If I ring the doorbell before the descriptor is visible, the NIC may fetch stale or incomplete metadata. If I free or reuse a TX buffer before completion, the NIC may DMA from memory that now belongs to something else. I have not shipped this in a Linux NIC driver, but I have studied and traced the concept, and it maps to the ownership and timing problems I have debugged in MCU-DSP and TX DSP firmware work.
- Where would a DMA barrier fit?
- What happens if a TX completion is lost?
- Why are RX buffers posted before packets arrive?
Whiteboard this: parse Ethernet, optional VLAN, IPv4, and return the L4 protocol byte safely from `uint8_t *buf, size_t len`.
I would narrate the safety rules first: never cast the packet pointer to a struct, never read beyond len, parse multi-byte fields explicitly in network byte order, and keep an offset that advances only after length checks.
A compact interview version is:
#include <stddef.h>
#include <stdint.h>
static int read_be16(const uint8_t *p) {
return ((int)p[0] << 8) | p[1];
}
int ipv4_proto(const uint8_t *buf, size_t len) {
size_t off = 0;
int ethertype;
uint8_t ihl;
if (len < 14)
return -1;
ethertype = read_be16(buf + 12);
off = 14;
if (ethertype == 0x8100) {
if (len < off + 4)
return -1;
ethertype = read_be16(buf + off + 2);
off += 4;
}
if (ethertype != 0x0800)
return -1;
if (len < off + 20)
return -1;
ihl = buf[off] & 0x0f;
if (ihl < 5)
return -1;
if (len < off + (size_t)ihl * 4)
return -1;
return buf[off + 9];
}
Then I would test aloud: too-short Ethernet frame, non-IPv4 EtherType, VLAN-tagged IPv4, IPv4 with invalid IHL, IPv4 with options, and a normal TCP packet returning 6. If they ask for production polish, I would add support for stacked VLAN tags only if required, use kernel endian helpers in kernel code, and return richer error codes if useful.
- Why avoid casting to `struct ethhdr *`?
- How would you handle QinQ or multiple VLAN tags?
- Where are the likely off-by-one bugs?
At first-screen depth, explain RDMA, RoCEv2 and Ultra Ethernet, and why AI clusters care.
I would frame this carefully: I have studied these concepts and can reason about them, but I have not shipped an RDMA stack, a RoCEv2 fabric, or a Linux NIC driver. My deepest shipped protocol work is wireless PHY and 3GPP-to-C firmware, so I would keep the answer conceptual and accurate.
RDMA means remote direct memory access. The goal is that one machine can move data directly into or out of registered memory on another machine with very little CPU involvement on the data path. That matters for AI because GPU training moves huge tensors between ranks. If every transfer burns host CPU, copies through intermediate buffers, or waits in unpredictable queues, the GPUs stall.
RoCEv2 is RDMA carried over routable Ethernet/IP using UDP. The attraction is that it can use Ethernet infrastructure while giving RDMA-style low CPU overhead. The hard part is that RDMA really dislikes loss and tail latency. A tiny packet-loss rate or a pause storm can stall a synchronized collective because one delayed rank holds up the next phase. So RoCEv2 deployments care about a lossless or near-lossless traffic class, usually with ECN, PFC and NIC congestion control tuned as one system.
Ultra Ethernet is the industry effort to make Ethernet more directly suited to AI and HPC scale-out. Conceptually, it is trying to reduce the operational pain points of large RoCE-style fabrics: incast, congestion, out-of-order delivery, packet spraying, faster loss recovery and lower tail latency under synchronized workloads. I would not claim to know AMD's internal implementation; I would say the public direction is clear: AI networking needs Ethernet that behaves predictably under huge many-to-many and many-to-one traffic, not just high average bandwidth.
The simple first-round summary is: RDMA reduces CPU and copy overhead, RoCEv2 brings RDMA semantics onto Ethernet, and UEC tries to make Ethernet transport behave better for AI-scale congestion and tail latency. The bridge to my CV is not that I have shipped this protocol stack. It is that I have implemented spec-driven low-level C in UL-DAI and TX DSP firmware, debugged MCU-DSP boundary issues, and triaged 100+ system problems where the real task was to connect a protocol requirement to hardware-visible timing, state and observability.
- Why does AI training care about p99 latency, not only average bandwidth?
- What does RoCEv2 gain and lose by using Ethernet/IP?
- Why is incast such a problem for synchronized collectives?
- How would you avoid overstating your experience here?