The questions they're likely to ask
Grouped by what each one probes. The trap or the thing the interviewer is listening for is called out next to each — because that's what actually moves the decision. Tap any question to open the model answer.
1 · Memory layout & alignment
Their bread and butter — descriptors and registers are structs over bytes.
What is sizeof this struct, and why? struct S { char a; int b; char c; };
On a typical system where int needs 4-byte alignment: a at offset 0, then 3 bytes of padding, b at 4–7, c at 8, then 3 bytes trailing paddingso the total is a multiple of the largest member's alignment (4). That's 12 bytes, not 6.
Reorder largest-to-smallest — struct { int b; char a; char c; } — and you get 8 bytes. #pragma pack(1) forces 6 bytes but b now sits at an unaligned offset: often slower, and unsafe if code later takes an unaligned pointer to it on a strict-alignment CPU. Only pack for on-the-wire / on-disk layouts, and prefer copying fields out with memcpy or explicit byte parsing.
#pragma pack.How do you safely parse a packet header out of a byte buffer?
Do not cast a struct hdr * straight onto the uint8_t * buffer. Main hazards: the buffer may not be alignedfor the struct (the access can fault or tear), the object may not have that struct's effective type (aliasing), and C struct paddingwon't match the wire layout. The safe pattern is memcpy into a properly-aligned struct, or parse field-by-field with explicit shifts — which also handles endianness for free. For example:
uint16_t ethertype; memcpy(ðertype, p + 12, sizeof ethertype); ethertype = ntohs(ethertype);
When and why would you use a flexible array member?
struct packet { uint16_t len; uint8_t data[]; }; — for a fixed header followed by a variable-length payload in a single allocation: malloc(sizeof(struct packet) + len). One alloc, one free, good cache locality. (Pre-C99 this was the data[1] “struct hack.”)
2 · Endianness
Networking — very likely to come up.
What do ntohs / htonl do, and when do you need them?
Network byte order is big-endian. POSIX htons/htonl convert host→network, and ntohs/ntohl network→host, for 16/32-bit integers (include <arpa/inet.h>— they're not ISO C). Use them for multi-byte numeric header fields: ports, lengths, IPv4 addresses. On a big-endian host they're no-ops; on little-endian (x86, most ARM) they byte-swap.
Swap a 16-bit value by hand: (uint16_t)(((uint16_t)x >> 8) | ((uint16_t)x << 8)) (cast first to dodge integer promotion). For 32-bit, use __builtin_bswap32 or shift a uint32_t with explicit masks.
How do you detect endianness at runtime?
Write a known value and inspect its first byte through an unsigned char *— the portable way to read an object's representation:
uint32_t x = 0x01020304; const unsigned char *p = (const unsigned char *)&x; // little-endian if p[0] == 0x04
unsigned char * rather than union punning — both work in C, but this one nobody can argue with.3 · volatile & memory-mapped I/O
Driver-specific — very high signal for this role.
What does volatile do, and when is it required?
It tells the compiler the object can change outside the program's visible control flow, so it must not cache the value in a register, elide a read/write, or reorder accesses relative to other volatile accesses. Commonly used for: memory-mapped hardware registers, a simple (atomically-accessed) variable touched by an ISR in embedded C, a volatile sig_atomic_t flag set by a signal handler, and setjmp/longjmp-visible locals.
_Atomic / barriers. volatile is for the single-core MMIO/ISR case.4 · Bit manipulation
Quick whiteboard — maps onto descriptor flags & register fields.
Set / clear / toggle / test bit n. And: is x a power of two?
x |= 1u << n; (set) · x &= ~(1u << n); (clear) · x ^= 1u << n; (toggle) · (x >> n) & 1u (test).
Power of two (for an unsigned x): x != 0 && (x & (x - 1)) == 0 — clears the lowest set bit; if the result is zero there was exactly one bit set. State it for unsigned so negatives / INT_MINcan't bite you.
UINT32_C(1) << n with n < 32 — shifting a signed int into the sign bit, or by ≥ the type width, is UB.Count the set bits in a word.
Kernighan — loops once per set bit, not per bit:
for (c = 0; x; c++) x &= x - 1;
In production: __builtin_popcount (compiles to a single POPCNT instruction where available).
5 · Pointers & declarations
Dispatch tables and intrusive lists are everywhere in drivers.
Read these declarations: void (*fp)(int) vs void *fp(int). And int *p[10] vs int (*p)[10].
void (*fp)(int) — a pointer to a function taking int, returning void. void *fp(int) — a function taking int, returning void *. The parentheses bind the * to the name.
int *p[10] — an array of 10 pointers to int. int (*p)[10] — a pointer to an array of 10 ints. Read inside-out, right-to-left.
Delete a node from a singly linked list without special-casing the head.
Walk a pointer-to-pointer — it points at the link you might have to rewrite, so the head is just another link:
// Remove every node with a given value — no head special case.
void remove_val(node **pp, int v) {
while (*pp) {
node *e = *pp;
if (e->val == v) { *pp = e->next; free(e); }
else { pp = &e->next; }
}
}
node **idiom. It's the elegant answer they hope to see; the clumsy version tracks a separate prev and branches on the head.6 · Undefined behavior & gotchas
Separates experienced from not. Usually “what's wrong with this code?”
Name the classic 'what's the bug?' UB cases.
- Returning a pointer to a stack local — dangling on return.
- Signed integer overflow is UB — use unsigned, or check before adding.
- Reading an uninitialized variable.
sizeof(arr)on an array parameter — it decayed to a pointer, so you get the pointer size, not the array size.- Off-by-one: writing
N+1bytes into anNbuffer.
Operator precedence: *p++ vs (*p)++, and what does a & b == c parse as?
*p++ is *(p++) — dereference, then advance the pointer. (*p)++ increments the pointed-to value.
a & b == c parses as a & (b == c) because == binds tighter than bitwise &. A classic bug — always parenthesize bit ops.
7 · Concurrency & atomics
Datapath relevance — this is a low-latency multi-core team.
Why does a shared counter need a lock or an atomic?
count++ is a read-modify-write, not a single atomic step. Two threads can both read the old value, both add one, and both store — one increment is lost. Fix with a mutex or atomic_fetch_add. Bonus insight: an atomic RMW still costs a locked bus cycle, which is why the SPSC ring avoids it entirely — give each index a single writer and you never need an RMW on the hot path.
What is false sharing, and why does this team care?
Two independent variables that happen to land on the same 64-byte cache line. When two cores each write their own variable, the cache-coherence protocol bounces the whole line between them, serializing work that looks parallel. Fix: pad / align to a cache line (alignas(64)).
head and tail on separate cache lines.Explain memory_order_relaxed / acquire / release / seq_cst.
relaxed — atomicity only, no ordering with other memory ops (fine for a counter only one thread writes). acquire (on a load) — nothing after it can be reordered before it; you see everything the releasing thread wrote before its release. release (on a store) — nothing before it can be reordered after it; it publishes prior writes. seq_cst — the default: a single global total order, correct but the most expensive.
The SPSC queue uses release to publish the index and acquire to observe it — a release/acquire pair is what creates the happens-before edge that makes the payload visible.
8 · OS & memory fundamentals
Reported AMD screens: virtual memory, allocators, process vs thread.
Virtual vs physical addresses, page tables, and the TLB?
Each process sees a private virtual address space; the MMU translates virtual → physical using page tables (multi-level, walked on a miss). The TLB caches recent translations so the walk is skipped on a hit. Relevance here: a NIC does DMA to physical addresses (or through an IOMMU), so a driver must pin/map pages and hand the device the right DMA address — CPU virtual pointers are meaningless to the hardware. Mention DMA coherency too: on a non-coherent system the driver needs cache clean/invalidate around DMA, or a coherent (uncached) mapping.
Implement a simple memory allocator with free-block coalescing.
Keep a free list of blocks, each with a header (size + free flag, often a footer too for backward merging). alloc finds a fit (first/best-fit), splits if the remainder is large enough. free marks the block free and coalesces with the physically-adjacent previous/next block if they're also free — that's what fights fragmentation. Mention alignment of returned pointers and the size/flag packed in the header's low bits.
What's the difference between a process and a thread?
A process has its own virtual address space; a threadis a schedulable execution context that shares its process's address space with sibling threads. Threads share code/heap/globals and have their own stack and registers — cheaper to create and switch, but they need synchronization because they share memory. (Reported verbatim at IMC; standard at AMD too.)
9 · “Implement this in C”
A 15–20 min live exercise. The ring buffer is the most likely.
Implement a single-producer / single-consumer ring buffer.
This is the headline exercise — it has its own walkthrough on the NIC datapath page. The four points to say out loud: power-of-two capacity → mask not modulo, free-running indices make full vs empty unambiguous, release/acquire publishes the payload before the index, and cache-line padding avoids false sharing.
volatile wouldn't cut it.Implement strlen (and how does the real one go faster?).
size_t my_strlen(const char *s) {
const char *p = s;
while (*p) p++;
return (size_t)(p - s);
}
// Real libc reads a word at a time and tests for a zero byte with a
// bit trick: (w - 0x0101...) & ~w & 0x8080... — worth mentioning.
Implement memcpy. What's the difference from memmove?
A byte loop is the baseline; the real one copies a word at a time once pointers are aligned. memcpy's src/dst are restrict — they must not overlap. If they can overlap, that's memmove, which copies backwards when dst > src to avoid clobbering.
Reverse a singly linked list.
The bread-and-butter pointer exercise (reported at Xilinx). Three pointers, one pass, no extra memory — flip each next as you go:
// Reverse a singly linked list in place — three pointers, O(1) space.
node *reverse(node *head) {
node *prev = NULL;
while (head) {
node *next = head->next; // save
head->next = prev; // flip the link
prev = head; // advance prev
head = next; // advance head
}
return prev; // new head
}
How would you build a flow hash table for packet lookup?
Hash the 5-tuple (src/dst IP, src/dst port, protocol). For a datapath, open addressing with linear probing is cache-friendly and avoids a malloc per insert; chaining is simpler but pointer-chases. Size to a power of two, keep load factor moderate, and have a story for collisions and deletion (tombstones).
10 · Rapid-fire
Short, high-frequency. Have one-liners ready.
malloc/free: leak, double-free, use-after-free, stack vs heap?
Leak — you drop the last pointer to a block without freeing it. Double-free — freeing the same block twice corrupts the allocator. Use-after-free — dereferencing a freed pointer (set it to NULL after free). Stack = automatic storage, freed on scope exit; heap = manual lifetime via malloc/free.
What are the two meanings of static?
At file scope: internal linkage — the symbol is not visible to other translation units. On a local variable: static storage duration — it persists across calls and is initialized once.
What's wrong with #define SQ(x) x*x ?
SQ(a + b) expands to a + b * a + b — wrong. Parenthesize everything: #define SQ(x) ((x) * (x)). Also beware double evaluation of side effects (SQ(i++)), and wrap multi-statement macros in do { ... } while (0).
const char *p vs char * const p ?
const char *p — pointer to const char: you can't write *p, but you can repoint p. char * const p — const pointer: you can write *p, but can't repoint p. Read right-to-left.