← Senior bank14 questions

🧩PCI Express for NIC Engineers

Probes whether a NIC engineer understands PCIe as a real transport with ordering, latency, interrupt, DMA, and platform failure modes rather than just a bus API.

A posted MMIO write is fire-and-forget; a non-posted read stalls the core.

Posted writes(e.g. a doorbell) — fire & forget

issuing…

The CPU issues each write and immediately moves on — the write drains to the device asynchronously.

Non-posted reads (an MMIO register read) — CPU stalls for the completion

stall…

0/5 done

Each read is a full PCIe round-trip; the core cannot retire it until the completion returns — so reads serialize and stall.

t = 0 · 5 writes finish ~6× sooner than 5 reads

The takeaway for the datapath: never poll a NIC register on the hot path. You ring a doorbell(a posted write that doesn't stall), and you learn about completions from DMA'd status / events in host memory — never from an MMIO read. One stray read can cost more than the whole rest of the send.

Walk through the Linux probe path for a PCIe NIC from enumeration to mapped BARs and bus mastering. What ordering of calls matters, and what can go wrong if you skip one?senior

A good probe path is roughly: match vendor/device or class, pci_enable_device_mem(), set DMA masks with dma_set_mask_and_coherent(), request BAR resources with pci_request_regions() or managed variants, map MMIO with pci_iomap()/pcim_iomap() or ioremap(), allocate queues, allocate MSI-X vectors with pci_alloc_irq_vectors(), request IRQs, then pci_set_master() once the device is ready to DMA.

The important details are not just ceremony:

BAR addresses in config space are not the CPU virtual addresses a driver dereferences; use resource helpers and ioremap/pci_iomap. They are also not what the device uses for DMA into host memory.
pci_set_master() sets the Bus Master Enable bit in the PCI_COMMAND register; without it the root complex drops the endpoint's Memory Write/Read requests, so DMA never leaves the device. Memory Space Enable in the same register gates inbound MMIO to BARs.
Requesting regions catches collisions and documents ownership.
The DMA mask must be set before mapping buffers, or the DMA API may hand the device addresses it cannot generate, silently routing through SWIOTLB or failing.
Interrupts should usually be registered while the device is quiesced, with device interrupt causes masked, to avoid early or stale interrupts.

static int nic_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
    int rc;

    rc = pcim_enable_device(pdev);
    if (rc)
        return rc;

    rc = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
    if (rc)
        return rc;

    rc = pcim_iomap_regions(pdev, BIT(0), "nic");
    if (rc)
        return rc;

    pci_set_master(pdev);
    return pci_alloc_irq_vectors(pdev, 1, nr_cpu_ids, PCI_IRQ_MSIX);
}

A staff-level answer also mentions shutdown symmetry: mask device interrupts, stop DMA, synchronize_irq()/free IRQs, quiesce queues, then unmap/free DMA resources. The reverse-order trap is freeing DMA memory while the device can still write to it, or calling pci_clear_master() before DMA has actually drained.

What they're listening for: Separates PCI config/resource setup from CPU virtual mapping and DMA enablement. The trap is treating BAR values as directly usable pointers or enabling interrupts before the device is quiesced.

Follow-ups

Where would you put `pci_set_master()` relative to queue initialization?
What changes if you use managed `pcim_*` helpers?
How do you debug a device that probes but never DMAs?

Why is an MMIO read from a NIC register much more expensive than an MMIO write, and how should that shape a low-latency datapath?senior

A PCIe MMIO write is usually a posted Memory Write TLP: the CPU/root complex can accept it into a posted-write queue and retire the store before the endpoint has acted on it. An MMIO read is a non-posted Memory Read request: the CPU cannot complete the load until a Completion with Data returns from the device. That round trip crosses the core, the uncore, the root complex, the link, a switch if present, the endpoint, and comes back as a completion. The load instruction stalls retirement the entire time, and because it is a strongly ordered uncached access it cannot be hidden by out-of-order execution the way a cache miss to DRAM can.

For a low-latency NIC datapath, this means:

Avoid MMIO reads in the hot path; poll host memory rings or completion queues that the NIC writes by DMA.
Batch doorbell writes, but beware added queueing latency.
Use readbacks deliberately as flushes only on slow paths such as reset, teardown, or error handling.
Prefer device designs where status is DMAed to cacheable host memory and the doorbell path is write-only.

Real systems vary, but MMIO reads are commonly on the order of several hundred nanoseconds directly off a root port, rising into microseconds through a switch or busy fabric, while a posted write can be retired by the CPU in tens of nanoseconds. A strong answer hedges those numbers and says to measure with rdtscp around readl() while pinning the CPU, disabling frequency scaling, and checking PCIe topology with lspci -tv.

What they're listening for: Tests whether the candidate understands posted versus non-posted PCIe transactions and why a single register read can dominate a microsecond budget.

Follow-ups

When is an MMIO read intentionally useful?
How would relaxed ordering change, or not change, this?
How would you measure it without fooling yourself?

Explain posted, non-posted, and completion TLPs using NIC examples. Which path carries a doorbell? Which path carries a register read?senior

A NIC driver doorbell write is normally a posted Memory Write TLP to a BAR. The CPU does not wait for an endpoint completion, so it is low overhead but can be buffered and combined before reaching the NIC.

A register read is a non-posted Memory Read TLP. The endpoint must return a Completion with Data (CplD). Configuration and I/O accesses are also non-posted, which is part of why config space is not a datapath mechanism. A non-posted write (rare on PCIe: I/O writes and config writes) returns a Completion without data (Cpl) carrying only status.

NIC DMA writes to host memory, including RX payload, RX descriptors, or completion entries, are also posted Memory Writes from the endpoint. NIC DMA reads of TX descriptors or packet data are non-posted Memory Reads from the endpoint and consume completion credits and a tag per outstanding request when host memory returns data.

The performance consequence is asymmetric: posted writes stress posted header/data credits and ordering; reads stress request tags (a finite pool, classically 32, extended to 256 or 10-bit/2048 with Extended Tag), non-posted credits, completion credits, completion latency, and Max Read Request Size. In a TX-heavy NIC, descriptor and payload fetch policy often matters as much as raw link bandwidth, because read throughput is gated by (tags x payload) / round-trip-latency long before it is gated by the wire.

What they're listening for: Looks for a concrete mapping from abstract TLP classes to NIC datapath operations, including the device-initiated read side and the tag/credit limits.

Follow-ups

Why can too many outstanding DMA reads hurt tail latency?
What limits read concurrency besides link bandwidth?
Where do completions get reordered or blocked?

State the PCIe transaction ordering rules a NIC relies on. Why is the producer-consumer model safe, and what exactly does the Relaxed Ordering attribute relax?staff

PCIe defines ordering within a single Traffic Class per requester using a small set of rules. The ones a NIC depends on:

A Posted write may not pass another Posted write (writes stay in order). This is what makes the producer-consumer model work: a NIC DMAs the packet payload (posted), then DMAs the completion descriptor or flips an ownership bit (posted), and the consumer is guaranteed to see the payload no later than the descriptor.
A Read Request must not pass a Posted write. This forces a read to push prior writes ahead of it, so a flushing readback works.
Completions may pass Posted writes and may pass each other if they are for different requests; completions for the same request stay in order.
Non-posted requests may be reordered relative to each other fairly freely, which is why you cannot use two reads to order anything.

Relaxed Ordering (the RO attribute bit in the TLP header) relaxes exactly the first rule for the marked transaction: an RO-marked Posted write is allowed to pass earlier Posted writes, and an RO completion may pass earlier writes. This lets switches and the root complex avoid head-of-line blocking, e.g. let bulk payload writes overtake each other into memory. The danger is marking the publish write (the descriptor or ownership flip) Relaxed: then the consumer can observe the descriptor before the payload it points to, which is silent data corruption. Safe designs mark only the payload data writes RO and keep the ownership/status write strongly ordered, or rely on a device that places the publish write last in strong order regardless. ID-Based Ordering (IDO) is a related, safer relaxation that only reorders traffic from different requester IDs.

What they're listening for: Wants the actual ordering rule (posted cannot pass posted) stated, not just the phrase relaxed ordering. The publish write must stay ordered.

Follow-ups

Which single rule makes a flushing readback work?
Why are two reads useless for ordering?
How is IDO safer than RO for a multi-function NIC?

How do PCIe ordering rules and Relaxed Ordering affect a NIC writing packet data and completion descriptors into host memory? Separate fabric ordering from CPU/cache ordering.staff

The producer-consumer invariant is: the CPU must not observe a completion descriptor that says packet N is ready before the packet bytes and metadata it points at are visible. PCIe's posted-cannot-pass-posted rule preserves this for writes from the same requester in the same TC, but the NIC design and descriptor attributes matter.

Relaxed Ordering can improve throughput by letting switches/root complexes reorder traffic that has no dependency. It is dangerous if applied to the publish write. The common rule is: data DMA may be relaxed if the platform and driver tolerate it, but the final ownership/status write that publishes the buffer should be strongly ordered relative to the data it publishes. Some devices use a separate writeback bit, generation counter, or phase bit specifically as the publish point.

The CPU side is a separate ordering domain. Cache coherence on x86 does not buy you compiler or load ordering: drivers must still use dma_rmb() (or the DMA API sync calls for streaming mappings) between reading the ownership bit and reading the rest of the descriptor, so the compiler and CPU do not hoist the payload load above the ownership load. On coherent x86 dma_rmb() is just a compiler barrier and is nearly free; on weakly ordered ARM it emits a dmb(oshld). Portable driver code cannot assume the cheap case.

The hard failure mode is intermittent stale data: the ring index or valid bit changes, the driver consumes the packet, and only under load or behind a switch does the payload line lag the completion line, or only on ARM does the CPU reorder the loads.

What they're listening for: Separates PCIe fabric ordering from CPU/cache ordering and recognizes that the publish write is the critical dependency on both sides.

Follow-ups

Would you allow Relaxed Ordering on TX descriptor reads?
What symptom would a missing `dma_rmb()` produce?
How could a NIC firmware change accidentally break this?

What are Max Payload Size and Max Read Request Size, and why might a NIC driver or firmware care about both?senior

Max Payload Size (MPS, Device Control bits 7:5) controls the largest data payload an endpoint may put in a single TLP it transmits. It is negotiated to the minimum supported across the whole path; a NIC behind a root port that only does 256B cannot use 512B even if both endpoints support it. Max Read Request Size (MRRS, Device Control bits 14:12) controls the largest Memory Read request the function may issue; completions are split by the completer's Read Completion Boundary (RCB, typically 64 or 128 bytes) and its MPS.

For NICs:

Larger MPS improves efficiency for DMA writes by amortizing the ~20-24 byte TLP header, but increases serialization delay and can worsen head-of-line blocking for latency-sensitive traffic.
Larger MRRS improves DMA read throughput for TX payload fetches (fewer requests, fewer tags consumed per byte), but a big read can monopolize completion bandwidth and increase tail latency for small control reads or other functions sharing the link.
The effective setting is constrained by the whole path: endpoint, root port, switches, and firmware/OS policy. Linux exposes a policy via pcie_bus_config (e.g. pcie_bus_safe, pcie_bus_perf).
Bugs show up as Malformed TLP, Unsupported Request, Completer Abort, poisoned completions, bad performance, or AER logs, not just clean negotiation failure. An MPS mismatch where a device emits a payload larger than a switch was told to expect is a classic source of Malformed TLP storms.

A senior engineer does not hardcode a magic largest value. They inspect lspci -vv (DevCtl: MaxPayload / MaxReadReq), benchmark small-packet latency and bulk throughput separately, and watch AER/correctable error counters.

What they're listening for: Tests whether the candidate sees packetization tradeoffs, RCB/completion splitting, and fabric constraints, not just bigger equals faster.

Follow-ups

Why can increasing MRRS make p99 latency worse?
What does `lspci -vv` show for MPS/MRRS?
What goes wrong if MPS is mismatched across a switch?

MSI versus MSI-X for a multi-queue NIC: walk through how an MSI-X interrupt is actually generated, and what matters beyond the maximum vector count.senior

Both MSI and MSI-X signal an interrupt by having the device issue a posted Memory Write TLP to a special address the OS programmed. The mechanics differ: MSI uses one address with up to 32 vectors encoded by mutating the low bits of a single data value, allocated as a power-of-two block; MSI-X has a per-vector table (in a BAR, located via the Table Offset/BIR in the capability) where each entry is 16 bytes: a 64-bit Message Address, a 32-bit Message Data, and a Vector Control dword whose bit 0 is the per-vector Mask. There is also a Pending Bit Array (PBA): if a vector is masked when the device wants to fire, it sets the PBA bit, and the write is emitted when the vector is unmasked.

MSI-X is preferred for high-end NICs because each vector's address/data and mask are independent, so vectors can be steered to different CPUs and masked individually. The design questions are:

Do RX/TX queues get one vector per queue pair, or are causes aggregated?
How is vector affinity aligned with RSS indirection, NAPI instances, IRQ affinity (/proc/irq/N/smp_affinity), and NUMA locality? pci_alloc_irq_vectors_affinity() lets the kernel spread vectors automatically.
What interrupt moderation policy balances p50 latency against interrupt rate and p99 burst behavior?
Are interrupts masked at the MSI-X Vector Control bit, the device's own cause/mask register, or both? Masking at the table is expensive (MMIO write); most NICs gate at a device register.
Does the driver handle lost-edge style races by polling after unmasking, since the message is a one-shot edge with no level to re-assert?

Linux uses pci_alloc_irq_vectors() with PCI_IRQ_MSIX/PCI_IRQ_MSI flags, then pci_irq_vector() and request_irq(). The staff nuance is that interrupts are control-plane hints; the datapath must be correct under polling (NAPI), coalescing, vector migration, and spurious or shared legacy INTx fallback.

What they're listening for: Checks if the candidate knows the MSI-X table/PBA mechanics and connects them to queueing, CPU affinity, NAPI, and moderation failure modes.

Follow-ups

What does the PBA do when a vector is masked?
Why is masking at the MSI-X table more costly than at a device register?
What is the unmask/poll race in NAPI?

Doorbell writes are often mapped write-combining for userspace kernel-bypass NICs. What are the benefits and hazards?staff

Write-combining lets the CPU merge adjacent or repeated stores in a WC buffer and drain them as larger PCIe writes. For a userspace TX path this reduces per-packet store overhead and improves doorbell/PIO bandwidth; it is attractive for queue doorbells, PIO send windows, and descriptor push paths where the whole descriptor is written into device memory.

The hazards are all about visibility and ordering:

A WC store can sit in a CPU buffer indefinitely; it drains on a fence, a serializing event, buffer fill, or an access to a different WC line. Latency is not bounded by the store itself.
Stores may be coalesced and reordered within the buffer, so a partially written descriptor could drain, and register layout must tolerate combining.
A later normal (WB) memory store does not order prior WC MMIO stores. On x86 you need sfence (WC ordering is weaker than the usual TSO); plain mov ordering does not apply to WC.
Mapping the wrong BAR region WC breaks registers that require exact width, exact order, or have read side effects.

Kernel drivers should use the correct mapping/accessor pair, for example ioremap_wc()/devm_ioremap_wc() only for regions designed for WC, and writel() (ordered) vs writel_relaxed() (no implicit barrier) deliberately. Kernel-bypass ABIs (verbs, EF_VI, DPDK UIO/VFIO) must document exactly which write publishes the descriptor and what fence is required before ringing the doorbell, since the application, not the kernel, now owns that ordering contract.

What they're listening for: Probes whether the candidate has seen write-combining as both a performance tool and a source of heisenbugs, and knows WC needs an explicit fence.

Follow-ups

When is `writel_relaxed()` wrong?
Why does WC need `sfence` when normal stores do not?
What hardware register design makes WC safe?

What is PCIe peer-to-peer DMA, and why is it hard to make reliable for NIC-to-GPU or NIC-to-NVMe traffic?staff

P2P DMA means one PCIe function directly targets another function's BAR or P2P memory instead of routing data through host DRAM. It can cut copies and DRAM bandwidth pressure, which is attractive for storage, GPU, and AI-cluster data paths (GPUDirect / NVMe-to-NIC).

The hard parts are platform routing and correctness:

Devices often need to be under the same switch or root port; many root complexes do not route peer traffic at all, or route it but with poor performance.
ACS (Access Control Services) on the switch/root port can force upstream redirection of P2P requests so they go up to the root complex and back down, defeating the expected path and changing ordering. Direct peer routing usually requires ACS redirect to be off for that path, which weakens isolation.
IOMMUs translate DMA to system memory; targeting a peer BAR requires the IOVA to map to that BAR (or ACS Direct Translated P2P), not arbitrary support.
Lifetime and invalidation are difficult: if the provider driver removes, resets (FLR), or re-BARs the device, consumers must stop DMA first or they write into a dead window.
Security isolation may intentionally block P2P.

In Linux, the PCI P2PDMA layer has pci_p2pdma_add_resource(), pci_p2pmem_find()/pci_p2pmem_alloc(), and pci_p2pdma_distance() to qualify whether a path is supported and how far traffic must travel. A serious design includes topology qualification, fallback to host memory, reset/remove coordination, and counters proving the traffic is not silently hairpinning through the root complex.

What they're listening for: Looks for platform-level skepticism, the ACS-vs-direct-routing tension, and knowledge of Linux P2PDMA constraints rather than a simple zero-copy claim.

Follow-ups

How does ACS force P2P traffic upstream?
What fallback would you ship?
How would you prove the traffic path?

What are ATS and ACS, and how do they matter to a high-performance NIC using an IOMMU?staff

ATS (Address Translation Services) lets a PCIe endpoint request translations from the IOMMU and cache them in its own Address Translation Cache (ATC, a device-side TLB). DMA then uses pre-translated addresses (the Translated bit set in the TLP) so the IOMMU does not re-walk on every access, cutting IOTLB pressure. It underpins PASID/PRI shared-virtual-memory designs, where PRI (Page Request Interface) lets the device fault in an unmapped page. The correctness-critical piece is invalidation: when a mapping changes, the IOMMU sends an ATS Invalidation request to the device, and the device must purge matching ATC entries and order all prior uses of those translations ahead of the invalidation completion before acking. A device that returns a stale translation, or acks before draining in-flight DMA, can write to freed or reassigned memory.

ACS (Access Control Services) lives in switches/root ports and controls whether peer traffic is allowed direct, redirected upstream, or blocked. It is essential for isolation and for correct IOMMU grouping: devices that can talk to each other without going through the IOMMU must share an IOMMU group, because the IOMMU cannot otherwise enforce a boundary between them. This is why ACS support directly determines VFIO/SR-IOV assignment granularity.

For a NIC the tradeoff is isolation versus latency/throughput. ATS reduces translation overhead but makes the device partly responsible for translation coherence; a buggy ATC is a security hole (the device asserting Translated bypasses IOMMU permission checks). ACS makes a topology safe for multi-tenant use but can redirect P2P up to the root complex and forces coarser IOMMU groups. A staff answer mentions VFIO, IOMMU groups, invalidation latency under churn, and testing on real topologies rather than assuming a feature bit means the platform path is viable.

What they're listening for: Tests reasoning across endpoint features, the ATC invalidation handshake, IOMMU grouping, virtualization, and switch policy.

Follow-ups

What happens if an ATS translation survives unmap?
Why does lack of ACS coarsen IOMMU groups?
Why is a device asserting the Translated bit a trust decision?

A NIC intermittently disappears or throws completion timeouts under load. Walk through completion timeout, the replay timer, and link recovery, and how you'd tell a fabric problem from a device hang.staff

There are two distinct timeout mechanisms, and conflating them sends you down the wrong path.

Completion Timeout is a transaction-layer timer at the requester: after issuing a non-posted request (e.g. a NIC's Memory Read for TX data), if no Completion returns within the programmed window the requester reports a Completion Timeout (an uncorrectable error, surfaced via AER). The range is set in Device Control 2 (Completion Timeout Value field): spec ranges run from Range A (50-100 us) up to Range D (tens of ms to ~64 s), with a typical default in the 50 us-50 ms band. A NIC seeing completion timeouts means its own read to host memory (or to a peer) never came back: causes include an IOMMU fault dropping the request, a switch/path problem, the completer (host or peer) stalling, or too many outstanding reads exhausting credits so requests never get serviced.

Replay timer / ACK-NAK is a data-link-layer mechanism, one layer below. Every TLP is held in the replay buffer until the receiver ACKs it; if the ACK does not arrive in time, or a NAK arrives, the sender replays from the buffer. Excessive replays indicate signal-integrity problems and show up as correctable errors (Bad TLP / Bad DLLP). Repeated failure drives the LTSSM into Recovery to retrain; if retraining fails the link can drop width/speed or go down entirely, which is when the device vanishes from lspci.

To separate them: check lspci -vv AER status for whether errors are correctable link errors (DLLP/TLP, replay, signal integrity) versus uncorrectable transaction errors (Completion Timeout, UR, CA). Look at LnkSta for unexpected speed/width downgrades (a fabric/SI symptom). Correlate timestamps with load, ASPM state, and temperature. A device hang typically shows clean link but timeouts and a stuck queue; a fabric/SI problem shows correctable-error escalation, replays, LTSSM Recovery, and link retraining or surprise removal. Experiments: pin link speed/width lower, disable ASPM, reduce MRRS/outstanding reads, move slots or reseat retimers/cables, and check IOMMU fault logs.

What they're listening for: Wants the two timers cleanly separated: completion timeout is transaction-layer at the requester; replay/ACK-NAK is link-layer and feeds LTSSM Recovery. Different root causes.

Follow-ups

Which register sets the completion timeout range?
What does an unexpected LnkSta width/speed downgrade suggest?
Why can too many outstanding reads cause completion timeouts?

What is ASPM, and how does it bite a latency-sensitive NIC? How would you confirm it's the culprit?senior

ASPM (Active State Power Management) lets the link autonomously enter low-power states during idle: L0s (fast, one-directional electrical idle, exit ~hundreds of ns) and L1 (deeper, both directions, exit often single-digit microseconds), with L1 substates L1.1/L1.2 saving more by gating common-mode/reference clock at the cost of even longer exit. The exit latency is paid on the first transaction after idle.

The NIC impact is tail latency and jitter, not average throughput. A bursty RX pattern lets the link drift to L1 between bursts; the next packet's DMA write or the doorbell's read then eats the L1 exit, adding microseconds to exactly the latency-sensitive first packet. Each device advertises its acceptable exit latency, and the OS is supposed to only enable a state if the path's exit latency is within the endpoint's tolerance, but firmware/BIOS misconfiguration and aggressive platform defaults frequently violate this in practice. This is a classic cause of mysterious p99 spikes that vanish under sustained load (because the link never idles).

To confirm: lspci -vv shows LnkCap (ASPM supported, exit-latency fields) and LnkCtl (ASPM control: enabled state). Toggle it with the pcie_aspm=off kernel parameter or per-device via sysfs (/sys/bus/pci/devices/<bdf>/link/l1_aspm, or the power/control and ASPM policy under /sys/module/pcie_aspm/parameters/policy), then re-measure the latency histogram. If p99/p99.9 collapses with ASPM off but throughput is unchanged, ASPM exit was the cause. The right fix is usually to disable L1(/substates) for that device rather than blanket-disabling ASPM across the platform, since other devices may need the power savings.

What they're listening for: Connects a power-management feature to tail-latency jitter and knows it shows up at low load, not high load. Confirmation via lspci LnkCtl plus a measured histogram.

Follow-ups

Why does the spike disappear under sustained load?
Which `lspci` fields show ASPM state and exit latency?
Why prefer disabling L1 for one device over `pcie_aspm=off` globally?

Your driver needs to reset a wedged NIC function without rebooting. What does an FLR actually do and not do, and what must the driver handle around it?staff

A Function Level Reset (FLR), triggered via the Initiate Function Level Reset bit in the Device Control register (and exposed in Linux through pci_reset_function() / pcie_flr()), resets a single function: it stops DMA, clears the function's internal state and queues, and returns most registers to defaults. Crucially for correctness, the spec requires the function to terminate or complete any outstanding requests so that no Completions arrive for transactions issued before the reset, and to not retain DMA state. The function must signal it can accept FLR and complete it within a bounded time (the spec allows up to 100 ms before config access should succeed again).

What FLR does NOT do is the trap: it does not preserve your configuration. BAR assignments, Command register (Bus Master/Memory Space), MPS/MRRS, MSI-X enable and the entire MSI-X table, ASPM, and AER settings are reset. The standard PCI header plus MSI/MSI-X and PM capabilities must be saved and restored by the OS (pci_save_state() before, pci_restore_state() after); the extended config space and any SR-IOV header are generally NOT restored by FLR handling, so the driver must reprogram device-specific state itself.

What the driver must handle: quiesce first (mask interrupts, stop queues, synchronize_irq()), tear down DMA mappings or ensure the device cannot reference them, save state, issue the reset, restore state, re-enable bus mastering, rebuild rings and re-map DMA, re-init MSI-X (re-request affinity), and re-arm. For SR-IOV, a PF reset implicitly disturbs its VFs; an FLR on a VF assigned to a guest via VFIO must be coordinated so the guest driver re-initializes. Skipping the quiesce means in-flight DMA or a late completion can corrupt memory or land on a half-reset function; skipping save/restore means the function comes back with no BARs and Bus Master Enable cleared, looking 'dead' even though the silicon is fine.

What they're listening for: Knows FLR resets config state (BARs, Command, MSI-X) so the OS must save/restore, must quiesce and tear down DMA first, and that extended/SR-IOV space is not restored. The 'comes back with no BARs' trap.

Follow-ups

Why must you `pci_save_state()` before FLR?
What in-flight hazard does FLR's completion-handling requirement prevent?
How does a VF FLR interact with a VFIO guest?

AER reports correctable Bad TLP/Bad DLLP and occasional completion timeouts under load on a NIC. How do you debug without immediately blaming the card?senior

AER is PCIe Advanced Error Reporting: it records correctable and uncorrectable link/transaction errors in config space and reports them through the OS. The first step is to capture facts: BDF, link speed/width (LnkSta), MPS/MRRS, AER status/header-log registers, kernel log timing, topology, firmware versions, ASPM state, retimers/cables if external, and whether errors correlate with load, power state, or temperature.

Then separate classes of failure:

Correctable Bad DLLP/Bad TLP point to the data link layer: signal integrity, a marginal retimer, a dirty connector, or marginal link training. These trigger replays; a rising replay/error rate that escalates to LTSSM Recovery is the signature of an SI problem.
Completion Timeout (uncorrectable) points to a transaction that never came back: endpoint firmware stall, root-port policy, too many outstanding reads, a device reset mid-flight, an IOMMU/ACS fault dropping the request, or a fabric path problem.
Unsupported Request or Completer Abort often implicates a bad BAR access, a config access to a disabled function, an MPS/MRRS violation (Malformed TLP), or stale DMA to a removed/FLRed function.

Practical steps include lspci -vvxxxx, decoding the AER header log to recover the offending TLP, setpci only carefully, forcing lower link speed/width to test SI margin, disabling ASPM for an experiment, changing slot/topology, reducing MRRS/outstanding reads, checking IOMMU fault logs, and correlating with the NIC's internal error counters. The goal is to turn a vague PCIe error into a reproducible transaction or link condition.

What they're listening for: Rewards systematic hardware/software fault isolation, knowledge of AER categories, and using the header log to identify the offending TLP.

Follow-ups

What would make you suspect signal integrity?
What does the AER header log give you?
How can reducing MRRS help diagnosis?

← Previous niche

📈 Performance and Latency Engineering

Next niche →

🧠 DMA and IOMMU