← Senior bank13 questions

🧬Advanced C and Undefined Behavior

Probes whether a senior low-level engineer can reason about the C abstract machine, optimizer assumptions, ABI details, and real driver failure modes.

A NIC fast path casts a DMA completion byte buffer to `struct rx_cqe *` and reads fields directly. What strict-aliasing, effective-type, alignment, and endian issues would you review before accepting it?staff

I would separate object representation from C object type. Bytes from a device or allocator can always be read through unsigned char * (character types may alias anything), but reading them through struct rx_cqe * is only well defined if that storage actually has rx_cqe as its effective type, or the implementation contract explicitly supports the mapping. Malloc'd or DMA storage has no declared type; the effective type becomes whatever you last *store* through it. Reading a never-written-as-rx_cqe region through that pointer is the part the optimizer can punish.

The checklist is:

Alignment: struct rx_cqe * may require 4-, 8-, or larger alignment. Forming a misaligned pointer is itself UB before any access; some CPUs (older Arm, some accelerators) trap, others split the load and hurt latency.
Packing: __attribute__((packed)) removes padding assumptions but tells the compiler the struct is under-aligned, so it emits byte-wise or unaligned loads. It is a correctness tool, not a performance answer.
Effective type and aliasing: under -O2 -fstrict-aliasing the compiler may assume a struct rx_cqe * does not alias unrelated typed objects, and may reorder or fold loads accordingly.
Endianness: descriptor fields are device byte order; decode with le16toh/le32toh, never assume native.
Volatility/coherency: DMA memory is not MMIO. volatile is not a cache-coherency primitive; use the platform DMA API (dma_rmb() on read, ownership flags) so you observe the descriptor only after the device released it.

The robust pattern is to copy bytes into a properly aligned object:

struct rx_cqe cqe;
memcpy(&cqe, bytes, sizeof cqe);
len = le16toh(cqe.len);

For small fixed-size copies, optimizing compilers turn memcpy into scalar loads while preserving the language-level aliasing story.

What they're listening for: Strong answers distinguish hardware layout concerns from C's effective-type rules and do not wave `volatile` or packed structs at the problem. The trap is assuming that because the bytes came from hardware, the compiler must treat any cast as legitimate. Bonus: noting that forming the misaligned pointer is UB independent of dereferencing it.

Follow-ups

When would you deliberately compile a driver component with `-fno-strict-aliasing`?
How would you detect an alignment-sensitive bug that reproduces only on Arm?
Why is `memcpy` often optimized away but still changes the definedness of the access?

What does `restrict` let the optimizer assume, and how can a wrong `restrict` annotation corrupt a packet-processing loop?senior

restrict is a promise for the lifetime of that pointer's association: any object accessed through that restricted pointer is accessed *only* through it or values derived from it, not through another independent pointer. It lets the compiler keep values in registers across stores, reorder loads/stores, vectorize, and skip reloads after a write through a different pointer.

This is only correct if dst and src do not overlap:

void copy_words(uint32_t * restrict dst,
                const uint32_t * restrict src,
                size_t n)
{
    for (size_t i = 0; i < n; i++)
        dst[i] = src[i] + 1;
}

Without restrict, the compiler must assume dst[i] could alias src[i+1] and reload after every store. With it, it can load a vector of src, add, and store, ignoring overlap. If a ring-buffer compaction path calls this with overlapping regions, behavior diverges from memmove: vectorized or prefetched copies read partially overwritten source. In a NIC path that means stale metadata, duplicated descriptors, or checksum flags copied from the wrong slot.

The senior point is that restrict is a semantic contract, not documentation. Removing it from a conforming program must not change observable behavior; adding it to a call that *does* overlap is UB. If overlap is possible, use memmove, split into overlap/non-overlap variants, or debug-assert the precondition at the call boundary.

What they're listening for: A strong answer explains both the optimization win and the contractual danger, and notes the asymmetry: removing `restrict` is always safe, adding it can break a previously-correct caller.

Follow-ups

How does `restrict` interact with pointers stored inside a struct, and what is block-scope restrict?
Would you expose `restrict` in a public driver API?
What compiler output would you inspect to prove it helped?

A reviewer sees `idx = idx++ & mask;` in a transmit ring. Explain the sequencing bug and why it may survive testing.senior

The expression both reads-and-modifies idx via idx++ and assigns to idx, with no sequencing relationship between the two side effects on the same scalar in one full expression. In C11 terms this is *unsequenced* modification of the same object, which is undefined behavior (the older 'two modifications between sequence points' phrasing means the same thing).

That one compiler emits something that appears to work at -O0 is not evidence of correctness. At -O2 the optimizer assumes UB never happens and may transform surrounding code on that assumption. In a ring this surfaces as skipped descriptors, an infinite poll loop, or a bug that only appears once LTO inlines the helper and re-derives the assumption across the call.

Write the state transition explicitly:

uint32_t old = idx;
idx = (idx + 1) & mask;
return old;

If idx is shared across producer/consumer contexts this is still insufficient; it then needs atomics or a lock with correct wrap semantics.

What they're listening for: This probes whether the candidate knows sequencing is not just a style issue. Good answers describe the optimizer consequence and use modern 'sequenced-before' terminology rather than only 'sequence point'.

Follow-ups

What changed in terminology between older C sequence points and modern sequencing language?
Why does `i = i + 1` not have the same problem?
Would UBSan reliably catch this, or can it miss it after inlining?

Why is signed integer overflow undefined behavior, and what does that mean for ring arithmetic, length checks, and compiler optimizations?staff

Signed overflow is undefined so the optimizer can assume it never happens. That enables transformations such as treating x + 1 > x as always true for signed int, simplifying a*2/2 to a, and proving loops with signed induction variables terminate. In packet code this is dangerous when lengths, offsets, or ring deltas use signed types and can cross the representable boundary.

For rings and sizes, prefer unsigned fixed-width or size_t. Unsigned overflow is defined modulo 2^N, but that alone does not make every expression correct: comparisons across wrap need deliberate half-range rules.

Review points:

int len = hdr_len + payload_len; if (len < hdr_len) is not a valid overflow check for signed values; the overflow it tries to detect is already UB, so the compiler may delete the branch.
uint32_t used = prod - cons; is the standard ring idiom only if capacity is at most half the counter range and both counters increase monotonically modulo 2^32; the subtraction wraps correctly precisely because it is unsigned.
Cast before widening: (uint64_t)a * b differs from (uint64_t)(a * b) when a * b overflows int first.

For hardened parsing use __builtin_add_overflow/__builtin_mul_overflow, C23 ckd_add/ckd_mul, or bounds checks written so neither operand can overflow.

What they're listening for: The strong signal is connecting UB to concrete optimizer assumptions and to wrap-safe ring-counter design. The trap is claiming unsigned is always safe without addressing comparison invariants.

Follow-ups

What does `-fwrapv` change in GCC/Clang, and why is relying on it risky for portable code?
How would you design a 32-bit producer/consumer counter for a 4096-entry ring?
Where should packet length validation happen relative to pointer arithmetic?

Explain integer promotions and usual arithmetic conversions using a packet parser example where a bounds check is accidentally bypassed.senior

Small integer types (uint8_t, uint16_t) are promoted, usually to int, before arithmetic. Then the usual arithmetic conversions pick a common type for binary operators. Bugs appear when signed and unsigned mix, or when subtraction happens before widening.

Example:

uint16_t off = get_off(pkt);
uint16_t len = get_len(pkt);
if (off + len <= frame_len) {
    parse(pkt + off, len);
}

off + len is computed as int (16-bit values fit), usually fine here. But a similar pattern with uint32_t operands and int frame_len converts the signed length to unsigned: a negative frame_len becomes a huge unsigned value and the check passes. Likewise frame_len - off underflows to a huge value if off > frame_len and the type is unsigned.

Safer pattern: normalize once into a type that holds the full range, validate without overflow, avoid mixed-sign comparisons:

size_t off = get_off(pkt);
size_t len = get_len(pkt);
if (off <= frame_len && len <= frame_len - off)
    parse(pkt + off, len);

The key skill is not memorizing conversion ranks; it is recognizing where the *expression* type differs from the type printed in the struct definition.

What they're listening for: A good answer shows how promotions change the expression type and silently affect security-critical bounds checks, and orders the subtraction so it cannot underflow.

Follow-ups

Why is `sizeof(a + b)` sometimes surprising for `uint8_t` operands?
How do you enforce no mixed-sign comparisons in CI (`-Wsign-compare`, `-Wconversion`)?
What is wrong with checking `off + len < off` after signed arithmetic?

How would you allocate and lay out a cache-line-aligned descriptor ring in portable C, and what are the gotchas around `alignof`, `aligned_alloc`, and `_Alignas`?senior

First distinguish language alignment from hardware performance alignment. _Alignof/alignof reports the required alignment for a C *type*, not the cache line, DMA boundary, IOMMU page, or NIC descriptor requirement.

aligned_alloc(alignment, size) (C11) requires size to be a multiple of alignment, and alignment must be a value the implementation supports. POSIX posix_memalign(&p, alignment, size) avoids the size-multiple rule and is often easier. For DMA memory neither is sufficient by itself: use the OS DMA allocation/mapping API so the device-visible address, cacheability, and lifetime are correct.

_Alignas over-aligns a declared object or struct member, and is the clean way to force a hot field onto its own line:

struct ring {
    _Alignas(64) uint32_t prod;   /* own cache line */
    _Alignas(64) uint32_t cons;   /* own cache line */
    struct desc *slots;
};

Note _Alignas raises a type's alignment but malloc only guarantees max_align_t; to actually get 64-byte storage from the heap you still need aligned_alloc/posix_memalign, and free releases memory from either. The real latency win is usually separating producer and consumer indices (and doorbell shadow state) so they do not false-share, not merely aligning the base pointer.

What they're listening for: This separates C object alignment from cache/DMA constraints, catches the `aligned_alloc` size rule, and knows `malloc` alone will not honor an over-aligned struct. Senior candidates mention false sharing and OS DMA APIs.

Follow-ups

When is 64-byte alignment the wrong assumption (Apple M-series 128B, DRAM page, huge pages)?
Why does over-aligning a struct member not change what plain `malloc` returns?
What should free the memory returned by `aligned_alloc`?

Compare union type punning, pointer casts, and `memcpy` for converting between `float`/`uint32_t` or descriptor views. What would you allow in shared low-level code?senior

Pointer-cast punning such as *(uint32_t *)&f is the most suspect: it can violate strict aliasing and may form a misaligned pointer. Union punning (writing one member, reading another) is *defined in C since C99/C11* (the read reinterprets the object representation), but it is technically *undefined* in standard C++ where most compilers allow it only as an extension. So union punning is fine for C-only code but a poor contract for headers shared with C++.

For common infrastructure compiled by multiple compilers and sometimes shared with C++, I prefer memcpy (or C++20 bit_cast on the C++ side). In C:

uint32_t bits;
float f = 1.0f;
_Static_assert(sizeof bits == sizeof f, "float size");
memcpy(&bits, &f, sizeof bits);

Compilers recognize fixed-size memcpy and emit a register move. It also documents intent: copying object representation, not pretending one object has two unrelated effective types.

For NIC descriptors I go further and define byte-order-aware field accessors, keeping endian, alignment, and compiler-behavior concerns localized rather than exposing raw punning to callers.

What they're listening for: The candidate must not overstate union portability *or* wrongly claim it is UB in C. The precise line is: defined in C, an extension in C++. A senior answer chooses a boring, optimizable idiom that survives the C/C++ boundary.

Follow-ups

How does C++20 `std::bit_cast` change the answer for C++ code, and what does it require of the types?
What would `-Wstrict-aliasing` catch or miss?
Why can character types inspect object representation when other types cannot?

When would you use a flexible array member in a packet or control-message structure, and what failure modes do you watch for?senior

A flexible array member fits variable-length trailing data:

struct msg {
    uint16_t type;
    uint16_t len;
    unsigned char data[];
};

The allocation must cover the header plus payload, and sizeof(struct msg) excludes the flexible array. The classic bugs are allocating only sizeof(struct msg) then writing data[0], and trusting the embedded len without checking the real buffer size.

Allocation arithmetic must not overflow. Prefer offsetof(struct msg, data) + len over sizeof(*m) + len (they differ by trailing padding), and guard the multiply/add:

if (n > (SIZE_MAX - offsetof(struct msg, data)) / elem)
    return NULL;  /* would overflow */
struct msg *m = malloc(offsetof(struct msg, data) + n * elem);

I also review wire format: the C layout may include padding before data, and multi-byte fields still need endian handling. For ABI-stable or device-visible formats a flexible array is fine only if the fixed header layout is asserted (_Static_assert(offsetof(struct msg, data) == EXPECTED, "wire header")) and the parser treats input as bytes until len <= buffer_len - offsetof(struct msg, data) is checked.

What they're listening for: This probes `sizeof`/`offsetof` behavior and refusal to confuse a C convenience with a wire-format guarantee. Good answers mention padding, allocation-overflow guarding, and validation order.

Follow-ups

Why must a flexible array member be last and the struct have at least one other member?
How do zero-length arrays (`data[0]`) differ as a GCC extension, including under `-fsanitize=bounds`?
Why prefer `offsetof(..., data)` over `sizeof(*m)` for the allocation size?

How do `_Generic`, designated initializers, and `_Static_assert` help maintain a low-level C codebase without hiding hardware details?senior

These features earn their place when they make invariants executable at compile time.

_Static_assert pins ABI and descriptor assumptions: struct size, alignment, offsets, enum values shared with firmware. Designated initializers make sparse hardware tables safer when constants are non-contiguous or fields get added (unmentioned fields zero-initialize). _Generic selects a type-correct helper at compile time without macros that evaluate arguments multiple times.

#define le_to_cpu(x) _Generic((x), \
    uint16_t: le16toh, \
    uint32_t: le32toh, \
    uint64_t: le64toh)(x)

_Static_assert(sizeof(struct tx_desc) == 16, "tx_desc ABI");
struct ops ops = { .poll = rx_poll, .kick = tx_kick };

A key subtlety: _Generic selects on the *type* of the controlling expression but does not evaluate it (like sizeof), and the controlling expression undergoes lvalue conversion, so _Generic(arr, ...) sees T*, not the array type. The gotcha is that clever generic macros obscure codegen and error messages; for a fast path I still inspect generated assembly and keep side effects out of macro arguments.

What they're listening for: A strong answer treats modern C as compile-time guardrails, not abstraction for its own sake, and knows `_Generic` does not evaluate its controlling expression and decays arrays.

Follow-ups

What does `_Generic` evaluate, and how does lvalue/array decay affect the chosen association?
Where would you put offset static assertions so a firmware ABI bump fails the build early?
How do designated initializers reduce risk during firmware ABI changes?

What is the as-if rule, and how does it explain `-O2` or `-O3` breaking naive polling, benchmarking, or MMIO code?staff

The as-if rule lets the compiler perform any transformation that preserves the *observable behavior* of a well-defined program. Observable behavior is volatile accesses, I/O via the standard library, and program termination side effects. The problem is that most hardware-facing intentions are invisible to the abstract machine unless expressed through the right mechanism.

Examples:

A spin loop on a non-atomic, non-volatile flag set by another thread can be hoisted: the compiler proves the flag is loop-invariant and turns it into if (!flag) for(;;);.
A microbenchmark whose result is unused is deleted entirely (dead-store / dead-code elimination).
Stores meant to order a descriptor before a doorbell can be reordered or merged unless you use atomics, a compiler barrier, or MMIO accessors with ordering semantics.
Signed-overflow or aliasing UB hands the optimizer permission to assume 'impossible' paths never occur.

For MMIO use the platform accessors (readl/writel) and documented DMA barriers, not raw dereferences. For inter-thread state use C11 atomics or kernel primitives. For benchmarking, consume the result through a compiler-visible sink (asm volatile("" :: "r"(x)) or DoNotOptimize) and serialize appropriately.

The mental model: the compiler optimizes the C program you wrote, not the hardware protocol you had in mind.

What they're listening for: Staff-level because it ties the abstract machine to device protocols and measurement methodology. The trap is treating `volatile` as a universal ordering, atomicity, or DMA-visibility solution.

Follow-ups

What exactly counts as 'observable behavior' in the standard?
Why can an empty polling loop disappear, and what minimal change prevents it?
How would you design a benchmark so the compiler cannot delete the work?

Explain pointer provenance and object lifetime. Why can `realloc` invalidate aliases, and why is comparing or dereferencing a pointer to a freed object UB even if the bit pattern looks valid?staff

A pointer value in C is not just an address; it carries provenance - an association with a particular object whose storage duration bounds the pointer's validity. The optimizer reasons about provenance to prove two pointers cannot alias even when their numeric addresses could coincide.

When realloc moves a block, the old pointer's object reaches end of lifetime; every existing pointer into it (including interior pointers and restrict-derived ones) becomes indeterminate. Using such a pointer - even to *compare* it, not only to dereference - is undefined. This bites NIC code that caches interior pointers into a buffer that later grows.

struct buf { uint8_t *base; size_t len, cap; };
uint8_t *hdr = b->base + off;     /* interior pointer */
b->base = realloc(b->base, b->cap *= 2);
/* hdr is now indeterminate even if realloc returned the same address */
hdr = b->base + off;              /* must re-derive from the new base */

Two consequences engineers miss:

A pointer one-past-the-end of one object can compare bit-equal to a pointer into an adjacent object, yet the standard treats them as distinct provenances; comparisons across unrelated objects are unspecified or undefined.
Round-tripping a pointer through uintptr_t and back does not necessarily restore provenance under aggressive analysis (this is the subject of WG14's provenance work, N2263/N3005).

The practical rule: after any reallocation or free, treat all derived pointers as dead and re-derive them from the live base; never compare or hash raw pointer values across object boundaries.

What they're listening for: Distinguishes 'address' from 'pointer value with provenance and lifetime'. Strong answers know that even *comparing* a dangling pointer is UB and that `realloc` can return the same address while still ending the old object's lifetime.

Follow-ups

Why does `free(p); if (p == q) ...` risk UB even without dereferencing?
How does provenance let the compiler keep a value in a register across an opaque call?
What is the practical impact of the WG14 provenance proposal on driver code?

Give the precise C semantics of `volatile`. What does a `volatile` access guarantee and not guarantee, and when is it actually the right tool?staff

volatile means every access to the object is an observable side effect: the compiler must emit exactly the loads and stores the abstract machine specifies, in program order *with respect to other volatile and I/O side effects*, and may not elide, duplicate, fuse, or reorder them past each other. That is its entire job.

What it does not provide:

Atomicity. A volatile uint64_t on a 32-bit target still tears.
Inter-thread ordering or visibility. It is not a memory barrier; non-volatile accesses around it, and the CPU's own reordering, are unconstrained. Two threads racing on a volatile is still a data race and UB.
Cache coherency. It does nothing about DMA visibility.

The three portable, correct uses are exactly:

MMIO: the address has side effects the compiler cannot see (volatile uint32_t *reg), usually wrapped in accessors that add the needed CPU/device barriers.
`volatile sig_atomic_t` for a flag shared with a signal handler in the *same* thread.
Locals modified between `setjmp` and `longjmp` - without volatile, their value after longjmp is indeterminate.

volatile uint32_t *status = ioremap_reg(BAR0 + STATUS);
while (!(*status & DONE))   /* re-read each iteration, not hoisted */
    cpu_relax();

Note the loop is correct against the *device* updating the register, but *status carries no ordering for any *other* memory; you still need a barrier before consuming data the DONE bit gates.

What they're listening for: The crisp answer is 'volatile constrains the compiler's treatment of those specific accesses, nothing about the hardware or other memory.' Naming the exact three portable uses (MMIO, sig_atomic_t, setjmp locals) is the staff signal.

Follow-ups

Why is `volatile` insufficient for a flag shared between two threads, but acceptable for a signal handler?
What ordering does a `volatile` load give relative to a neighboring non-volatile store?
Why do kernels often wrap MMIO in `readl`/`writel` rather than bare `volatile` dereferences?

Distinguish a compiler barrier from a CPU memory barrier in C. Show the minimal primitives and where each is necessary in a driver.staff

They constrain different reordering layers:

A compiler barrier stops the compiler from moving memory accesses across a point, but emits no instruction; the CPU may still reorder. In GCC/Clang: asm volatile("" ::: "memory"), or portably atomic_signal_fence(memory_order_seq_cst).
A CPU memory barrier emits a fence (e.g. mfence, dmb ish) that constrains the *hardware's* ordering and implies a compiler barrier too. Portably: atomic_thread_fence(memory_order_*).

Which you need depends on who the other observer is:

Two threads on cache-coherent CPUs: a CPU barrier (or, better, acquire/release atomics) is required; a compiler barrier alone is not enough on weakly-ordered hardware like Arm.
A thread versus a signal handler on the *same* core: only a compiler barrier is needed - the core sees its own stores in order - which is exactly what atomic_signal_fence is for.
A thread versus a DMA device: you need the platform DMA barrier, not a plain C fence. atomic_thread_fence orders accesses for the C memory model and other CPUs, but the DMA path may require dma_wmb() semantics that also account for write-combining and device-visible ordering.

write_descriptor(&ring[i], desc);   /* fill descriptor   */
dma_wmb();                          /* device sees descriptor before owner bit */
ring[i].flags = cpu_to_le32(OWN_DEVICE);
wmb();                              /* order vs MMIO doorbell */
writel(i, doorbell);                /* kick */

The staff point: x86's strong (TSO) model hides most of this in testing - it only reorders store-then-load - so a missing barrier often passes on x86 and fails on Arm. Audit by reasoning about the model, not by what reproduces.

What they're listening for: Strong answers map the three observer cases (other thread / same-core signal handler / DMA device) to compiler barrier vs CPU fence vs DMA barrier, and note that x86 TSO masks the bug. The trap is thinking `asm volatile("" ::: "memory")` orders anything for the hardware.

Follow-ups

Why is `atomic_signal_fence` enough for a signal handler but not for a second thread?
On x86, which single reordering does the hardware still allow, and which barrier kills it?
Why might `dma_wmb()` differ from a generic `atomic_thread_fence(release)` on a given arch?

← Back

All niches

Next niche →

🔒 C11/C++11 Memory Model and Lock-Free Concurrency