๐งฌAdvanced C and Undefined Behavior
Probes whether a senior low-level engineer can reason about the C abstract machine, optimizer assumptions, ABI details, and real driver failure modes.
A NIC fast path casts a DMA completion byte buffer to `struct rx_cqe *` and reads fields directly. What strict-aliasing, effective-type, alignment, and endian issues would you review before accepting it?staff
I would separate object representation from C object type. Bytes from a device or allocator can always be read through unsigned char * (character types may alias anything), but reading them through struct rx_cqe * is only well defined if that storage actually has rx_cqe as its effective type, or the implementation contract explicitly supports the mapping. Malloc'd or DMA storage has no declared type; the effective type becomes whatever you last *store* through it. Reading a never-written-as-rx_cqe region through that pointer is the part the optimizer can punish.
The checklist is:
- Alignment:
struct rx_cqe *may require 4-, 8-, or larger alignment. Forming a misaligned pointer is itself UB before any access; some CPUs (older Arm, some accelerators) trap, others split the load and hurt latency. - Packing:
__attribute__((packed))removes padding assumptions but tells the compiler the struct is under-aligned, so it emits byte-wise or unaligned loads. It is a correctness tool, not a performance answer. - Effective type and aliasing: under
-O2 -fstrict-aliasingthe compiler may assume astruct rx_cqe *does not alias unrelated typed objects, and may reorder or fold loads accordingly. - Endianness: descriptor fields are device byte order; decode with
le16toh/le32toh, never assume native. - Volatility/coherency: DMA memory is not MMIO.
volatileis not a cache-coherency primitive; use the platform DMA API (dma_rmb()on read, ownership flags) so you observe the descriptor only after the device released it.
The robust pattern is to copy bytes into a properly aligned object:
struct rx_cqe cqe;
memcpy(&cqe, bytes, sizeof cqe);
len = le16toh(cqe.len);
For small fixed-size copies, optimizing compilers turn memcpy into scalar loads while preserving the language-level aliasing story.
- When would you deliberately compile a driver component with `-fno-strict-aliasing`?
- How would you detect an alignment-sensitive bug that reproduces only on Arm?
- Why is `memcpy` often optimized away but still changes the definedness of the access?
What does `restrict` let the optimizer assume, and how can a wrong `restrict` annotation corrupt a packet-processing loop?senior
restrict is a promise for the lifetime of that pointer's association: any object accessed through that restricted pointer is accessed *only* through it or values derived from it, not through another independent pointer. It lets the compiler keep values in registers across stores, reorder loads/stores, vectorize, and skip reloads after a write through a different pointer.
This is only correct if dst and src do not overlap:
void copy_words(uint32_t * restrict dst,
const uint32_t * restrict src,
size_t n)
{
for (size_t i = 0; i < n; i++)
dst[i] = src[i] + 1;
}
Without restrict, the compiler must assume dst[i] could alias src[i+1] and reload after every store. With it, it can load a vector of src, add, and store, ignoring overlap. If a ring-buffer compaction path calls this with overlapping regions, behavior diverges from memmove: vectorized or prefetched copies read partially overwritten source. In a NIC path that means stale metadata, duplicated descriptors, or checksum flags copied from the wrong slot.
The senior point is that restrict is a semantic contract, not documentation. Removing it from a conforming program must not change observable behavior; adding it to a call that *does* overlap is UB. If overlap is possible, use memmove, split into overlap/non-overlap variants, or debug-assert the precondition at the call boundary.
- How does `restrict` interact with pointers stored inside a struct, and what is block-scope restrict?
- Would you expose `restrict` in a public driver API?
- What compiler output would you inspect to prove it helped?
A reviewer sees `idx = idx++ & mask;` in a transmit ring. Explain the sequencing bug and why it may survive testing.senior
The expression both reads-and-modifies idx via idx++ and assigns to idx, with no sequencing relationship between the two side effects on the same scalar in one full expression. In C11 terms this is *unsequenced* modification of the same object, which is undefined behavior (the older 'two modifications between sequence points' phrasing means the same thing).
That one compiler emits something that appears to work at -O0 is not evidence of correctness. At -O2 the optimizer assumes UB never happens and may transform surrounding code on that assumption. In a ring this surfaces as skipped descriptors, an infinite poll loop, or a bug that only appears once LTO inlines the helper and re-derives the assumption across the call.
Write the state transition explicitly:
uint32_t old = idx;
idx = (idx + 1) & mask;
return old;
If idx is shared across producer/consumer contexts this is still insufficient; it then needs atomics or a lock with correct wrap semantics.
- What changed in terminology between older C sequence points and modern sequencing language?
- Why does `i = i + 1` not have the same problem?
- Would UBSan reliably catch this, or can it miss it after inlining?
Why is signed integer overflow undefined behavior, and what does that mean for ring arithmetic, length checks, and compiler optimizations?staff
Signed overflow is undefined so the optimizer can assume it never happens. That enables transformations such as treating x + 1 > x as always true for signed int, simplifying a*2/2 to a, and proving loops with signed induction variables terminate. In packet code this is dangerous when lengths, offsets, or ring deltas use signed types and can cross the representable boundary.
For rings and sizes, prefer unsigned fixed-width or size_t. Unsigned overflow is defined modulo 2^N, but that alone does not make every expression correct: comparisons across wrap need deliberate half-range rules.
Review points:
int len = hdr_len + payload_len; if (len < hdr_len)is not a valid overflow check for signed values; the overflow it tries to detect is already UB, so the compiler may delete the branch.uint32_t used = prod - cons;is the standard ring idiom only if capacity is at most half the counter range and both counters increase monotonically modulo 2^32; the subtraction wraps correctly precisely because it is unsigned.- Cast before widening:
(uint64_t)a * bdiffers from(uint64_t)(a * b)whena * boverflowsintfirst.
For hardened parsing use __builtin_add_overflow/__builtin_mul_overflow, C23 ckd_add/ckd_mul, or bounds checks written so neither operand can overflow.
- What does `-fwrapv` change in GCC/Clang, and why is relying on it risky for portable code?
- How would you design a 32-bit producer/consumer counter for a 4096-entry ring?
- Where should packet length validation happen relative to pointer arithmetic?
Explain integer promotions and usual arithmetic conversions using a packet parser example where a bounds check is accidentally bypassed.senior
Small integer types (uint8_t, uint16_t) are promoted, usually to int, before arithmetic. Then the usual arithmetic conversions pick a common type for binary operators. Bugs appear when signed and unsigned mix, or when subtraction happens before widening.
Example:
uint16_t off = get_off(pkt);
uint16_t len = get_len(pkt);
if (off + len <= frame_len) {
parse(pkt + off, len);
}
off + len is computed as int (16-bit values fit), usually fine here. But a similar pattern with uint32_t operands and int frame_len converts the signed length to unsigned: a negative frame_len becomes a huge unsigned value and the check passes. Likewise frame_len - off underflows to a huge value if off > frame_len and the type is unsigned.
Safer pattern: normalize once into a type that holds the full range, validate without overflow, avoid mixed-sign comparisons:
size_t off = get_off(pkt);
size_t len = get_len(pkt);
if (off <= frame_len && len <= frame_len - off)
parse(pkt + off, len);
The key skill is not memorizing conversion ranks; it is recognizing where the *expression* type differs from the type printed in the struct definition.
- Why is `sizeof(a + b)` sometimes surprising for `uint8_t` operands?
- How do you enforce no mixed-sign comparisons in CI (`-Wsign-compare`, `-Wconversion`)?
- What is wrong with checking `off + len < off` after signed arithmetic?
How would you allocate and lay out a cache-line-aligned descriptor ring in portable C, and what are the gotchas around `alignof`, `aligned_alloc`, and `_Alignas`?senior
First distinguish language alignment from hardware performance alignment. _Alignof/alignof reports the required alignment for a C *type*, not the cache line, DMA boundary, IOMMU page, or NIC descriptor requirement.
aligned_alloc(alignment, size) (C11) requires size to be a multiple of alignment, and alignment must be a value the implementation supports. POSIX posix_memalign(&p, alignment, size) avoids the size-multiple rule and is often easier. For DMA memory neither is sufficient by itself: use the OS DMA allocation/mapping API so the device-visible address, cacheability, and lifetime are correct.
_Alignas over-aligns a declared object or struct member, and is the clean way to force a hot field onto its own line:
struct ring {
_Alignas(64) uint32_t prod; /* own cache line */
_Alignas(64) uint32_t cons; /* own cache line */
struct desc *slots;
};
Note _Alignas raises a type's alignment but malloc only guarantees max_align_t; to actually get 64-byte storage from the heap you still need aligned_alloc/posix_memalign, and free releases memory from either. The real latency win is usually separating producer and consumer indices (and doorbell shadow state) so they do not false-share, not merely aligning the base pointer.
- When is 64-byte alignment the wrong assumption (Apple M-series 128B, DRAM page, huge pages)?
- Why does over-aligning a struct member not change what plain `malloc` returns?
- What should free the memory returned by `aligned_alloc`?
Compare union type punning, pointer casts, and `memcpy` for converting between `float`/`uint32_t` or descriptor views. What would you allow in shared low-level code?senior
Pointer-cast punning such as *(uint32_t *)&f is the most suspect: it can violate strict aliasing and may form a misaligned pointer. Union punning (writing one member, reading another) is *defined in C since C99/C11* (the read reinterprets the object representation), but it is technically *undefined* in standard C++ where most compilers allow it only as an extension. So union punning is fine for C-only code but a poor contract for headers shared with C++.
For common infrastructure compiled by multiple compilers and sometimes shared with C++, I prefer memcpy (or C++20 bit_cast on the C++ side). In C:
uint32_t bits;
float f = 1.0f;
_Static_assert(sizeof bits == sizeof f, "float size");
memcpy(&bits, &f, sizeof bits);
Compilers recognize fixed-size memcpy and emit a register move. It also documents intent: copying object representation, not pretending one object has two unrelated effective types.
For NIC descriptors I go further and define byte-order-aware field accessors, keeping endian, alignment, and compiler-behavior concerns localized rather than exposing raw punning to callers.
- How does C++20 `std::bit_cast` change the answer for C++ code, and what does it require of the types?
- What would `-Wstrict-aliasing` catch or miss?
- Why can character types inspect object representation when other types cannot?
When would you use a flexible array member in a packet or control-message structure, and what failure modes do you watch for?senior
A flexible array member fits variable-length trailing data:
struct msg {
uint16_t type;
uint16_t len;
unsigned char data[];
};
The allocation must cover the header plus payload, and sizeof(struct msg) excludes the flexible array. The classic bugs are allocating only sizeof(struct msg) then writing data[0], and trusting the embedded len without checking the real buffer size.
Allocation arithmetic must not overflow. Prefer offsetof(struct msg, data) + len over sizeof(*m) + len (they differ by trailing padding), and guard the multiply/add:
if (n > (SIZE_MAX - offsetof(struct msg, data)) / elem)
return NULL; /* would overflow */
struct msg *m = malloc(offsetof(struct msg, data) + n * elem);
I also review wire format: the C layout may include padding before data, and multi-byte fields still need endian handling. For ABI-stable or device-visible formats a flexible array is fine only if the fixed header layout is asserted (_Static_assert(offsetof(struct msg, data) == EXPECTED, "wire header")) and the parser treats input as bytes until len <= buffer_len - offsetof(struct msg, data) is checked.
- Why must a flexible array member be last and the struct have at least one other member?
- How do zero-length arrays (`data[0]`) differ as a GCC extension, including under `-fsanitize=bounds`?
- Why prefer `offsetof(..., data)` over `sizeof(*m)` for the allocation size?
How do `_Generic`, designated initializers, and `_Static_assert` help maintain a low-level C codebase without hiding hardware details?senior
These features earn their place when they make invariants executable at compile time.
_Static_assert pins ABI and descriptor assumptions: struct size, alignment, offsets, enum values shared with firmware. Designated initializers make sparse hardware tables safer when constants are non-contiguous or fields get added (unmentioned fields zero-initialize). _Generic selects a type-correct helper at compile time without macros that evaluate arguments multiple times.
#define le_to_cpu(x) _Generic((x), \
uint16_t: le16toh, \
uint32_t: le32toh, \
uint64_t: le64toh)(x)
_Static_assert(sizeof(struct tx_desc) == 16, "tx_desc ABI");
struct ops ops = { .poll = rx_poll, .kick = tx_kick };
A key subtlety: _Generic selects on the *type* of the controlling expression but does not evaluate it (like sizeof), and the controlling expression undergoes lvalue conversion, so _Generic(arr, ...) sees T*, not the array type. The gotcha is that clever generic macros obscure codegen and error messages; for a fast path I still inspect generated assembly and keep side effects out of macro arguments.
- What does `_Generic` evaluate, and how does lvalue/array decay affect the chosen association?
- Where would you put offset static assertions so a firmware ABI bump fails the build early?
- How do designated initializers reduce risk during firmware ABI changes?
What is the as-if rule, and how does it explain `-O2` or `-O3` breaking naive polling, benchmarking, or MMIO code?staff
The as-if rule lets the compiler perform any transformation that preserves the *observable behavior* of a well-defined program. Observable behavior is volatile accesses, I/O via the standard library, and program termination side effects. The problem is that most hardware-facing intentions are invisible to the abstract machine unless expressed through the right mechanism.
Examples:
- A spin loop on a non-atomic, non-volatile flag set by another thread can be hoisted: the compiler proves the flag is loop-invariant and turns it into
if (!flag) for(;;);. - A microbenchmark whose result is unused is deleted entirely (dead-store / dead-code elimination).
- Stores meant to order a descriptor before a doorbell can be reordered or merged unless you use atomics, a compiler barrier, or MMIO accessors with ordering semantics.
- Signed-overflow or aliasing UB hands the optimizer permission to assume 'impossible' paths never occur.
For MMIO use the platform accessors (readl/writel) and documented DMA barriers, not raw dereferences. For inter-thread state use C11 atomics or kernel primitives. For benchmarking, consume the result through a compiler-visible sink (asm volatile("" :: "r"(x)) or DoNotOptimize) and serialize appropriately.
The mental model: the compiler optimizes the C program you wrote, not the hardware protocol you had in mind.
- What exactly counts as 'observable behavior' in the standard?
- Why can an empty polling loop disappear, and what minimal change prevents it?
- How would you design a benchmark so the compiler cannot delete the work?
Explain pointer provenance and object lifetime. Why can `realloc` invalidate aliases, and why is comparing or dereferencing a pointer to a freed object UB even if the bit pattern looks valid?staff
A pointer value in C is not just an address; it carries provenance - an association with a particular object whose storage duration bounds the pointer's validity. The optimizer reasons about provenance to prove two pointers cannot alias even when their numeric addresses could coincide.
When realloc moves a block, the old pointer's object reaches end of lifetime; every existing pointer into it (including interior pointers and restrict-derived ones) becomes indeterminate. Using such a pointer - even to *compare* it, not only to dereference - is undefined. This bites NIC code that caches interior pointers into a buffer that later grows.
struct buf { uint8_t *base; size_t len, cap; };
uint8_t *hdr = b->base + off; /* interior pointer */
b->base = realloc(b->base, b->cap *= 2);
/* hdr is now indeterminate even if realloc returned the same address */
hdr = b->base + off; /* must re-derive from the new base */
Two consequences engineers miss:
- A pointer one-past-the-end of one object can compare bit-equal to a pointer into an adjacent object, yet the standard treats them as distinct provenances; comparisons across unrelated objects are unspecified or undefined.
- Round-tripping a pointer through
uintptr_tand back does not necessarily restore provenance under aggressive analysis (this is the subject of WG14's provenance work, N2263/N3005).
The practical rule: after any reallocation or free, treat all derived pointers as dead and re-derive them from the live base; never compare or hash raw pointer values across object boundaries.
- Why does `free(p); if (p == q) ...` risk UB even without dereferencing?
- How does provenance let the compiler keep a value in a register across an opaque call?
- What is the practical impact of the WG14 provenance proposal on driver code?
Give the precise C semantics of `volatile`. What does a `volatile` access guarantee and not guarantee, and when is it actually the right tool?staff
volatile means every access to the object is an observable side effect: the compiler must emit exactly the loads and stores the abstract machine specifies, in program order *with respect to other volatile and I/O side effects*, and may not elide, duplicate, fuse, or reorder them past each other. That is its entire job.
What it does not provide:
- Atomicity. A
volatile uint64_ton a 32-bit target still tears. - Inter-thread ordering or visibility. It is not a memory barrier; non-volatile accesses around it, and the CPU's own reordering, are unconstrained. Two threads racing on a
volatileis still a data race and UB. - Cache coherency. It does nothing about DMA visibility.
The three portable, correct uses are exactly:
- MMIO: the address has side effects the compiler cannot see (
volatile uint32_t *reg), usually wrapped in accessors that add the needed CPU/device barriers. - `volatile sig_atomic_t` for a flag shared with a signal handler in the *same* thread.
- Locals modified between `setjmp` and `longjmp` - without
volatile, their value afterlongjmpis indeterminate.
volatile uint32_t *status = ioremap_reg(BAR0 + STATUS);
while (!(*status & DONE)) /* re-read each iteration, not hoisted */
cpu_relax();
Note the loop is correct against the *device* updating the register, but *status carries no ordering for any *other* memory; you still need a barrier before consuming data the DONE bit gates.
- Why is `volatile` insufficient for a flag shared between two threads, but acceptable for a signal handler?
- What ordering does a `volatile` load give relative to a neighboring non-volatile store?
- Why do kernels often wrap MMIO in `readl`/`writel` rather than bare `volatile` dereferences?
Distinguish a compiler barrier from a CPU memory barrier in C. Show the minimal primitives and where each is necessary in a driver.staff
They constrain different reordering layers:
- A compiler barrier stops the compiler from moving memory accesses across a point, but emits no instruction; the CPU may still reorder. In GCC/Clang:
asm volatile("" ::: "memory"), or portablyatomic_signal_fence(memory_order_seq_cst). - A CPU memory barrier emits a fence (e.g.
mfence,dmb ish) that constrains the *hardware's* ordering and implies a compiler barrier too. Portably:atomic_thread_fence(memory_order_*).
Which you need depends on who the other observer is:
- Two threads on cache-coherent CPUs: a CPU barrier (or, better, acquire/release atomics) is required; a compiler barrier alone is not enough on weakly-ordered hardware like Arm.
- A thread versus a signal handler on the *same* core: only a compiler barrier is needed - the core sees its own stores in order - which is exactly what
atomic_signal_fenceis for. - A thread versus a DMA device: you need the platform DMA barrier, not a plain C fence.
atomic_thread_fenceorders accesses for the C memory model and other CPUs, but the DMA path may requiredma_wmb()semantics that also account for write-combining and device-visible ordering.
write_descriptor(&ring[i], desc); /* fill descriptor */
dma_wmb(); /* device sees descriptor before owner bit */
ring[i].flags = cpu_to_le32(OWN_DEVICE);
wmb(); /* order vs MMIO doorbell */
writel(i, doorbell); /* kick */
The staff point: x86's strong (TSO) model hides most of this in testing - it only reorders store-then-load - so a missing barrier often passes on x86 and fails on Arm. Audit by reasoning about the model, not by what reproduces.
- Why is `atomic_signal_fence` enough for a signal handler but not for a second thread?
- On x86, which single reordering does the hardware still allow, and which barrier kills it?
- Why might `dma_wmb()` differ from a generic `atomic_thread_fence(release)` on a given arch?