๐ง DMA and IOMMU
Probes whether a low-level engineer can reason about DMA mappings, cache coherency, IOVA translation, barriers, and real Linux driver failure modes.
DMA to an IOVA โ the IOMMU translates and bounds-checks it, or blocks it.
Contrast coherent DMA allocations with streaming DMA mappings in a NIC driver. Which would you use for descriptors, packet buffers, and completion queues?senior
Coherent (consistent) DMA memory from dma_alloc_coherent() gives the CPU a virtual address and the device a dma_addr_t where both sides see each other's writes without explicit cache maintenance. On a non-coherent architecture the kernel typically backs this with uncached or write-combining memory and may insert barriers; on coherent x86 it is normal write-back memory. It is the natural fit for descriptor rings, completion queues, doorbell/event records, and small control structures that are touched continuously by both sides.
Streaming mappings from dma_map_single(), dma_map_page(), or dma_map_sg() are for buffers handed to the device for a bounded DMA operation and later unmapped or synced. They are the right fit for packet payload buffers because they avoid allocating all payload memory from scarce coherent pools, let the DMA API manage IOMMU/cache state per transfer, and can map memory the driver already has (e.g. an skb fragment, a page-pool page).
The tradeoff is cost and semantics: coherent memory can be scarce, slower to access if uncached, and page-granular; streaming mappings require correct direction, lifetime, and sync calls and per-transfer map/unmap cost (an IOTLB/PTE update under an IOMMU). A high-rate NIC keeps rings coherent and payload buffers streaming, often via page_pool so DMA mappings are set up once and recycled rather than mapped per packet.
- Why not put all RX packet buffers in coherent memory?
- What does `dma_addr_t` represent?
- When would a coherent allocation still need a write barrier?
Even on coherent memory you write descriptor fields then ring a doorbell. Why is a barrier still needed, and which one?staff
Coherence and ordering are different guarantees. Cache coherence means the CPU and device eventually agree on the contents of a line without manual flush/invalidate; it says nothing about the order in which stores become visible. The compiler can reorder independent stores, and the CPU's store buffer can drain them out of order, so the device could fetch a descriptor whose ownership/valid field is set while an earlier field (address, length) is still in the store buffer.
The correct barrier is dma_wmb() between filling the descriptor and the store that publishes it (the valid/ownership bit or the doorbell). dma_wmb() orders prior writes to DMA-coherent memory as observed by the device. It is deliberately lighter than wmb(): on x86 it is just a compiler barrier (barrier()), since TSO already keeps store order; on ARM it emits dmb(oshst) (outer-shareable store barrier) rather than the heavier dsb() that wmb() uses. Pair it with dma_rmb() on the consume side, between reading the ownership bit and reading the rest of the descriptor.
/* publish a descriptor in coherent ring memory */
desc->addr = cpu_to_le64(dma);
desc->len = cpu_to_le16(len);
dma_wmb(); /* addr/len visible before VALID */
desc->flags = cpu_to_le16(DESC_VALID);
Missing the barrier is the canonical 'works on x86, fails on ARM' DMA bug: x86's strong store ordering hides it, and only a weakly ordered CPU exposes the device reading a valid-but-incomplete descriptor.
- Why is `dma_wmb()` cheaper than `wmb()`?
- What does this compile to on x86 versus ARM?
- Where does `writel()` already include a barrier?
Explain the lifecycle of an RX buffer mapped with `dma_map_single()` and reused across packets. Where do `dma_sync_single_for_cpu()` and `dma_sync_single_for_device()` belong?senior
For an RX buffer the direction is normally DMA_FROM_DEVICE. The driver allocates CPU memory, maps it with dma_map_single(dev, buf, len, DMA_FROM_DEVICE), programs the returned dma_addr_t into the RX descriptor, and gives ownership to the NIC.
When the NIC completes a packet, the CPU must not read the buffer until the device has published completion and the mapping has been synchronized for CPU ownership (which, on a non-coherent system, invalidates any stale CPU cache lines so the CPU sees the device's data, not stale cache). After processing, if the same buffer is reposted, the driver synchronizes it back for the device before handing ownership over.
/* Buffer was mapped once with DMA_FROM_DEVICE and posted to RX. */
if (rx_done(desc)) {
dma_sync_single_for_cpu(dev, dma, len, DMA_FROM_DEVICE);
consume_packet(buf);
dma_sync_single_for_device(dev, dma, len, DMA_FROM_DEVICE);
repost_rx_desc(desc, dma, len);
}
If the buffer will not be reused, dma_unmap_single() after completion is enough; do not keep using the DMA address afterward. The gotchas are direction and partial sync: using DMA_TO_DEVICE for RX, or reading before the for_cpu sync, can work on coherent x86 and corrupt on non-coherent ARM or wherever the DMA API does real cache maintenance. Also, the buffer the device writes into between for_device and the next completion is owned by the device: the CPU must not touch it, or the for_cpu invalidate can throw away the CPU's dirty line (or the CPU reads stale cache).
- What changes if the CPU writes packet headers before TX?
- Can `dma_sync_single_for_cpu()` be skipped on x86?
- What if only the first 128 bytes are read?
What are CPU virtual, CPU physical, bus, and IOVA addresses? Why is using `virt_to_phys()` in a PCIe NIC driver usually a bug?senior
The CPU uses virtual addresses; the MMU translates those to CPU physical addresses. A device performing DMA uses a bus address, which under an IOMMU is an IOVA (I/O virtual address) that the IOMMU translates to a physical address via its own page tables. These spaces can coincide on simple systems, but portable Linux drivers must assume they do not.
dma_map_single() takes a CPU virtual address and returns a dma_addr_t: the address the device should put on the bus. That value may be an IOVA allocated from the device's IOMMU domain, may be constrained by the device DMA mask, and may point at a bounce buffer (SWIOTLB) rather than the original page.
virt_to_phys() returns the raw physical address and bypasses all of that: it ignores IOMMU mappings (the device's IOVA space may have no mapping for that physical page, so the DMA faults), cache maintenance, the DMA mask, memory-encryption/confidential-computing address bits (SEV/TDX set a shared/encryption bit the device-visible address must reflect), highmem, and SWIOTLB. It can appear to work on a developer's IOMMU-off x86 box and then fault or corrupt memory on a production platform with the IOMMU on or memory encryption active. The only correct exceptions are narrow architecture-specific code that owns the translation model; a normal PCIe NIC driver must use the DMA API.
- What address is stored in a hardware descriptor?
- How does an IOMMU change the value?
- Why does memory encryption break `virt_to_phys()` for DMA?
What exactly does the DMA direction argument do, and what failures come from choosing `DMA_BIDIRECTIONAL` everywhere?senior
The direction tells the DMA API who will write the buffer and therefore what cache maintenance and IOMMU permissions are required:
DMA_TO_DEVICE: CPU produced the data, device reads it. On map/sync, dirty CPU cache lines are cleaned (written back) so the device reads current data. Typical TX payload, TX descriptor fetch. IOMMU maps read-only for the device.DMA_FROM_DEVICE: device writes, CPU later reads. On map/sync-for-device the region is invalidated; on sync-for-cpu it is invalidated again so the CPU does not see stale cache. Typical RX payload. IOMMU maps write-only for the device.DMA_BIDIRECTIONAL: both may read and write during the mapping; the API must clean and invalidate, and the IOMMU maps read-write.DMA_NONE: debugging sentinel, not a real direction.
Using DMA_BIDIRECTIONAL everywhere can mask bugs but costs performance (extra clean+invalidate cache work on non-coherent systems) and weakens IOMMU permissions, so a buggy device can write where it should only read. Worse, it hides ownership errors until a non-coherent or strict-IOMMU platform changes behavior. The senior answer is the narrowest correct direction, and to structure ownership so a buffer is never simultaneously writable by CPU and device. If the CPU modifies a TX buffer after dma_map_single(..., DMA_TO_DEVICE), it must dma_sync_single_for_device() again before the NIC reads the modified bytes.
- What direction is an RX completion ring?
- What if the NIC writes checksum status into an RX buffer header?
- How would DMA API debugging catch this?
Describe cache maintenance on non-coherent architectures for RX and TX. Why can a driver pass x86 testing and fail on ARM?staff
On coherent x86-like systems, hardware keeps CPU caches and DMA coherent for normal cached memory, though ordering barriers still matter. On non-coherent systems the DMA API must clean dirty CPU cache lines before device reads and invalidate CPU cache lines before the CPU reads device-written data.
For TX, the CPU writes the packet then maps or syncs DMA_TO_DEVICE; the clean pushes the bytes from cache to memory so the NIC fetches current data. For RX, the driver syncs for device before posting; after completion it syncs for CPU, invalidating stale lines so the CPU sees the NIC's writes rather than old cache contents.
The subtle bugs are alignment and sharing:
- If two unrelated objects share a cache line with a streaming DMA buffer, the invalidate on the DMA region can discard dirty CPU data in the same line (or a CPU write can be lost). This is why DMA buffers should be cache-line aligned and sized; the kernel exposes
ARCH_DMA_MINALIGN/dma_get_cache_alignment(). - Reusing buffers without
dma_sync_single_for_device()leaves the device writing into a line the CPU still has dirty, so the CPU's later writeback can clobber device data. - Reading a descriptor's valid bit before syncing/
dma_rmb()for the rest exposes stale fields.
This is why the DMA API forbids casual DMA to arbitrary stack or vmalloc memory: stack memory shares cache lines with other locals and may not be physically contiguous, and vmalloc memory is virtually but not physically contiguous so a single dma_addr_t cannot describe it.
- What is false sharing with DMA buffers?
- Why are stack and vmalloc buffers bad DMA targets?
- What test hardware would expose this fastest?
How do scatter-gather DMA mappings work, and what is the common bug around `nents` versus the return value of `dma_map_sg()`?senior
dma_map_sg() maps an input scatterlist of nents entries and returns the number of DMA segments the device should be programmed with. That returned count can be smaller than nents because the DMA layer or IOMMU can coalesce physically or IOVA-adjacent entries into one segment. The driver must program hardware by iterating the mapped count with for_each_sg(sgl, sg, mapped, i) and reading sg_dma_address(sg)/sg_dma_len(sg) (the DMA-side fields), not the original sg->page/sg->length.
The unmap call is the classic trap: dma_unmap_sg() must be passed the original nents, not the mapped count. It re-derives the mapped segments internally; passing the smaller mapped count under-unmaps and leaks IOVA/IOMMU mappings (eventually exhausting the aperture) or, on bounce paths, fails to copy back.
Other senior concerns: a zero return means failure and must be checked; the device has segment-count, max-segment-size, and boundary/alignment limits (dma_set_max_seg_size(), dma_set_seg_boundary()) that constrain how the SG list may be built; and a NIC TX path must decide whether to linearize (skb_linearize()), respect TSO/GSO limits, or stop the queue when the mapped segment count exceeds the hardware descriptor budget.
- What fields should hardware descriptors use from an SG entry?
- What do you do if mapped segments exceed hardware limits?
- Why can an IOMMU merge entries?
What is SWIOTLB, when do bounce buffers appear, and why does it matter to a high-rate NIC?senior
SWIOTLB is Linux's software bounce-buffer mechanism, a pool of low physical memory used when a device cannot DMA directly to the target address or platform policy requires bouncing. It appears because of a limited DMA mask (device can't reach the buffer's physical address), memory encryption / confidential computing (data must transit a shared, unencrypted bounce region), restricted-DMA windows, virtualization, or lack of a usable IOMMU mapping.
On map for DMA_TO_DEVICE, data is copied into a bounce buffer the device can read. On DMA_FROM_DEVICE, the device writes the bounce buffer and the data is copied back to the real buffer on sync/unmap. That memcpy on every packet is poison for high packet rates, and the bounce pool is finite, so it can be exhausted under load, causing mapping failures.
A senior engineer checks dma_mapping_error() on every map, watches dmesg for 'swiotlb buffer is full' messages and the swiotlb counters, validates the mask with dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64)), and measures map/unmap cost separately from NIC hardware latency. The classic silent regression: the device supports 64-bit DMA but the driver sets a 32-bit mask (or the platform places memory above 4 GB), so everything quietly routes through SWIOTLB and throughput falls off a cliff while the NIC itself looks fine. You can size the pool with the swiotlb= boot parameter, but the real fix is correct masks and/or an IOMMU.
- How would you detect bounce buffering in production?
- What happens under SWIOTLB exhaustion?
- Why does memory encryption force bouncing?
Before ringing a NIC doorbell, what ordering is required between descriptor writes in memory and the MMIO doorbell write? Where does `dma_wmb()` fit?staff
The NIC must not observe the doorbell before it can observe the descriptors and packet metadata the doorbell publishes. There are two ordering domains: CPU stores to DMA-visible memory and CPU stores to MMIO.
A typical pattern:
/* Fill DMA-visible TX descriptors first. */
desc->addr = cpu_to_le64(dma);
desc->len = cpu_to_le16(len);
desc->flags = cpu_to_le16(TX_VALID);
/* Descriptor stores must reach memory before the doorbell. */
dma_wmb();
writel(new_tail, txq->doorbell);
dma_wmb() orders writes to DMA memory as seen by the device. writel() is an ordered MMIO accessor that on most arches already carries a barrier ordering prior memory writes before the MMIO store, so it can subsume the role of a trailing barrier; writel_relaxed() drops that barrier and then you must supply ordering yourself. The pairing matters: dma_wmb() orders the descriptor stores among themselves and before the doorbell on weakly ordered CPUs. On x86 strong ordering hides a missing barrier, but portable NIC drivers must express the contract explicitly.
The failure mode is brutal: the NIC fetches a descriptor after seeing the new tail and reads an old address, old length, or a VALID bit set before the rest of the descriptor landed. It DMAs from the wrong memory or transmits garbage. These bugs reproduce only on weakly ordered CPUs (ARM, POWER) or with write-combining doorbell mappings where the doorbell can drain ahead of unfenced WB stores.
- Why is `smp_wmb()` not the right barrier here?
- What changes with a write-combined userspace doorbell?
- Would you put the valid bit first or last?
How does an IOMMU protect the system, and what performance costs show up in a low-latency NIC datapath?staff
An IOMMU translates device-visible IOVAs to physical memory and enforces per-device/per-domain permissions. Intel VT-d and ARM SMMU are the common examples. The security value is containment: a buggy or malicious NIC (or a VF) can only DMA to pages mapped into its domain, which is the foundation of VFIO, SR-IOV passthrough, and multi-tenant isolation, and a defense against malicious-DMA attacks.
The costs are translation, invalidation, and map/unmap overhead. High packet-rate paths hit IOTLB misses (a page-table walk per miss), expensive invalidations on unmap, lock and IOVA-allocator contention in the DMA/IOMMU layer (the iova rbtree / per-CPU magazine caches), and latency spikes when mappings are created or torn down too often. Unmap is frequently the worst offender because, in strict mode, it must synchronously flush the IOTLB entry before the IOVA can be reused, and that flush is a round trip to the IOMMU.
Mitigations: long-lived mappings for rings and buffer pools (map once, reuse), page_pool-style recycling so payload buffers are not mapped per packet, batching unmaps, aligning/huge-page-backing buffers to reduce mapping count, and choosing the right invalidation mode for the platform's security bar. A staff answer refuses to just 'disable the IOMMU': that may be unacceptable for isolation, and on confidential-computing or passthrough platforms it is not even an option.
- What is an IOTLB miss and what does it cost?
- Why is unmap often more expensive than map?
- How does SR-IOV change the IOMMU story?
Explain strict versus lazy/deferred IOMMU invalidation, and passthrough mode. What is the security and performance tradeoff, and how does it interact with a NIC's map/unmap churn?staff
When an IOVA mapping is torn down, the stale IOTLB entry must be invalidated before that IOVA is reused, or a device could still DMA through the old translation. The mode controls when:
- Strict (
iommu.strict=1, the default for DMA domains): everydma_unmap_*()synchronously flushes the IOTLB before returning. Maximum isolation: the moment the driver unmaps, the device can no longer reach that memory. Cost: a synchronous IOMMU invalidation round trip on the unmap hot path, which is exactly what hurts a high-rate NIC doing per-packet map/unmap. - Lazy / deferred (
iommu.strict=0, alsoCONFIG_IOMMU_DEFAULT_DMA_LAZY): unmaps free the IOVA into a per-CPU flush queue and the IOTLB is flushed in batches (and the IOVA only handed back after the batched flush). Far less per-unmap overhead and much better throughput, but there is a window after unmap where the device can still DMA to the just-freed page using the stale cached translation, weakening isolation against a buggy/malicious device. - Passthrough (
iommu.passthrough=1/ identity domain): the device uses an identity (1:1) map with no per-DMA translation; effectively no IOTLB churn and near-native performance, but no IOMMU protection for that device at all (it can DMA anywhere). Kernel DMA may then fall back to SWIOTLB only where a mask requires it.
The NIC interaction: per-packet streaming maps make strict mode's synchronous unmap flush a real bottleneck, which is why high-rate drivers prefer map-once-and-recycle (page_pool) so the mode barely matters, or run lazy when the isolation bar allows. The principled answer is that the choice is a security policy, not a perf knob: lazy/passthrough trade device isolation for speed, and that is unacceptable for untrusted VFs, multi-tenant, or confidential-computing platforms where strict (or at least non-passthrough) is required.
- What is the danger window in lazy mode?
- Why does map-once-and-recycle make the mode almost irrelevant?
- When is passthrough unacceptable?
How does a modern NIC driver (page_pool) avoid per-packet DMA map/unmap on the RX path, and what sync is still required?staff
Per-packet dma_map_single()/dma_unmap_single() on RX is expensive: under an IOMMU each is an IOVA allocation and a PTE/IOTLB update, and unmap in strict mode is a synchronous flush. page_pool solves this by mapping pages once and recycling them. With PP_FLAG_DMA_MAP, the pool DMA-maps each page when it is allocated and keeps the mapping; pages are returned to a per-NAPI cache (and a ptr_ring) and re-handed to the device without unmapping. The DMA mapping lives for the lifetime of the page in the pool, so the costly map/unmap happens once per page, not once per packet.
What is still required is cache sync, not remapping. Because the buffer is DMA_FROM_DEVICE and reused, before the page is given back to the NIC for a fresh write the relevant range must be synced for the device. With PP_FLAG_DMA_SYNC_DEV the pool does this automatically via page_pool_dma_sync_for_device(), syncing only max_len bytes from offset rather than the whole page, which on non-coherent systems is a meaningful saving. On the consume side the driver still syncs/dma_rmb() for CPU before reading the received bytes. On coherent x86 these syncs are nearly free; on ARM they are real cache operations, so bounding the synced length matters.
The correctness points: the page must not be recycled to the device while the stack still references it (page_pool uses page refcounts / page_pool_put_page() to gate this), and the synced range must cover everything the device will write. XDP and skb paths share the same pool so a redirected/transmitted frame is recycled rather than freed and remapped.
- Why sync only `max_len` instead of the whole page?
- What gates recycling a page back to the device?
- How does this change the cost of strict-mode IOMMU?
You see rare RX data corruption only with IOMMU enabled and high memory pressure. How would you debug the DMA side?staff
Start by making the failure observable and classifying it: stale old packet, partial new packet, wrong buffer, overrun, or descriptor/status mismatch. Then audit the DMA contract rather than staring only at the NIC.
Key checks:
- Every
dma_map_*()result is checked withdma_mapping_error(). - The driver never uses a DMA address after
dma_unmap_*()(stale IOVA reuse). - RX buffers are not recycled to the device before the stack is done with them.
dma_sync_single_for_cpu()/dma_sync_single_for_device()are paired correctly for reused streaming mappings.- SG unmap uses the original
nents, not the mapped count. - The DMA mask is correct and no unexpected SWIOTLB bouncing occurs under pressure.
- Descriptor ownership accesses use
dma_wmb()/dma_rmb()as appropriate. - IOMMU fault logs, AER logs, and NIC internal DMA error counters are collected with timestamps.
Why 'IOMMU on + memory pressure' is a strong hint: under lazy invalidation, an IOVA freed on unmap can be reassigned to a new buffer while the device still holds a stale cached translation, so a late or mis-tracked DMA lands on the wrong page; memory pressure accelerates IOVA reuse and buffer turnover, making the window hit. Switching to strict mode and seeing the corruption vanish points squarely at a use-after-unmap / lifetime bug rather than a cache-sync bug.
Useful experiments: enable DMA API debug (CONFIG_DMA_API_DEBUG), enable an IOMMU fault handler / intel_iommu=on,strict, poison buffers on ownership transitions, add generation counters to descriptors, force single-queue operation, disable page recycling, vary MTU/fragmentation, and compare strict vs lazy vs SWIOTLB-heavy configs. The goal is to prove whether corruption is stale cache visibility, stale IOVA reuse, lifetime misuse, or actual device overwrite.
- What would stale IOVA reuse look like, and why does lazy mode enable it?
- How would you instrument buffer ownership?
- What experiment distinguishes cache sync from device overwrite?