← Senior bank10 questions

🧠AI/HPC Cluster Networking

Probes whether a candidate can reason from verbs, PCIe, congestion, and collectives up to whole-cluster AI training behavior.

Slide the GPU count and compare ring (2(N−1)) vs tree (2·log₂N) all-reduce.

G0G1G2G3G4G5G6G7
Ring
2(N−1) = 14
Tree
2⌈log₂N⌉ = 6
hop 0/14
Why the algorithm isn't “a detail.” Ring all-reduce is bandwidth-optimal (each GPU sends ~2× its data, total independent of N) but takes 2(N−1) latency hops; tree / halving-doubling takes only 2⌈log₂N⌉. At N = 1024 that's 2046 hops vs 20. On a GPU cluster the collective is on the critical path — one slow link or a bad algorithm choice stalls every GPU, which is exactly why AI fabrics obsess over tail latency and in-order delivery.
A training job over RoCEv2 shows good median bandwidth but periodic multi-second stalls in all-reduce. Walk through how you would distinguish endpoint CQ/QP problems from fabric congestion and PFC pathologies.staff

I would start by separating transport progress from fabric health.

On endpoints I would sample ibv_poll_cq() progress, CQ overrun/async events, retry/RNR counters if exposed, NIC tx/rx pause counters, ECN/CNP counters, and per-QP error transitions. A CQ that is not drained can eventually break the QP; a missing receive on RC can create RNR; a bad MR/lkey/rkey usually produces a work completion error rather than a fabric-wide stall.

On the fabric I would inspect per-priority PFC pause frames, pause duration, ECN mark rate, queue occupancy histograms if available, packet drops, and whether congestion is localized to a few uplinks. RoCEv2 normally relies on UDP/IP routing plus a lossless or near-lossless class. If PFC is triggered frequently, it can propagate backpressure and create head-of-line blocking outside the original hotspot.

For collectives, I would correlate the stall with rank-level timelines. One slow rank can stall a ring or tree phase because other ranks wait at the next dependency. If every rank stops at the same collective step, suspect a shared fabric event. If one rank stops first and others cascade, suspect that host, GPU, NIC, PCIe path, or its attached leaf.

I would run a controlled test: pairwise RDMA write/read, then many-to-one incast, then the real NCCL/RCCL pattern. Use packet capture or switch telemetry for ECN/PFC and endpoint tracing for CQ completion latency. The multi-second magnitude is itself a clue: that is far longer than any RTT or ECN loop, so it usually means a PFC pause storm, a retransmit timeout, or a stalled host thread, not ordinary queueing. The key is not just average throughput; the diagnosis lives in the tail of completion latency and pause propagation.

An RDMA queue pair: post WR, NIC executes, completion lands.
An RDMA queue pair: post WR, NIC executes, completion lands.
What they're listening for: A strong answer connects CQ/QP semantics, PFC/ECN/DCQCN behavior, and collective synchronization rather than treating the problem as generic bandwidth loss, and reads the multi-second timescale as diagnostic.
Follow-ups
  • What counter pattern would make you suspect PFC storm propagation?
  • How can CQ moderation hide the first symptoms?
  • Why can one bad rank stall an otherwise healthy ring?
Compare RoCEv2 and InfiniBand for a large GPU training fabric. Do not stop at marketing differences; discuss control plane, loss model, congestion control, and operational failure modes.senior

InfiniBand is a purpose-built RDMA fabric with its own link layer, a centralized subnet manager that programs forwarding and assigns LIDs, service levels, and credit-based link-level flow control that makes the fabric lossless by construction. RoCEv2 carries RDMA over UDP/IP (the BTH rides in a UDP datagram, destination port 4791), so it can use Ethernet routing and operational tooling, but it inherits Ethernet queueing, ECMP, QoS, and loss-management problems.

The practical distinction is not simply latency. It is how hard it is to keep the transport assumptions true. InfiniBand's credit flow control means a sender never transmits unless the next hop has buffer, so loss is essentially designed out. RoCEv2 has no such guarantee, so it usually depends on a carefully engineered Ethernet class using PFC, ECN, and NIC congestion control such as DCQCN. Misconfigured DSCP/PCP mappings, asymmetric ECMP, pause thresholds, or buffer carve-outs can turn an otherwise fast Ethernet network into a tail-latency amplifier. Classic go-back-N RoCE NICs also retransmit from the lost packet onward, so a single drop can be very expensive; newer NICs add selective retransmit.

RoCEv2 has operational advantages: IP routability, reuse of Ethernet switch ecosystems, larger vendor pool, and easier integration into existing data centers. InfiniBand often has stronger vertically integrated semantics and SHARP-style in-network reduction for HPC. For AI clusters, the answer is workload and operations dependent: large synchronized collectives punish rare tail events, so the fabric with better tail behavior under failures is often better than the one with the best clean-room microbenchmark.

What they're listening for: The nuance is that IB is lossless by credit construction while RoCEv2 must be engineered lossless via PFC/ECN/DCQCN as one system; bonus for go-back-N vs selective retransmit.
Follow-ups
  • Where does UDP source port selection matter in RoCEv2?
  • What makes ECMP hashing dangerous for collectives?
  • What does PFC solve, and what new problem does it create?
Explain the lifecycle of an RDMA buffer used for GPU communication, including memory registration, keys, and what GPUDirect RDMA changes.senior

For ordinary host memory, the application allocates a buffer, registers it with ibv_reg_mr(), and receives an ibv_mr containing lkey and rkey. The NIC needs the registration so it can DMA to pinned pages and perform address translation and protection checks: registration pins the pages (so they cannot be swapped or moved) and populates the NIC translation tables so a virtual address maps to the right physical/IOVA pages. Local send/receive SGEs use lkey; remote RDMA read/write operations require the peer to know the target virtual address and rkey.

GPUDirect RDMA changes the physical target of DMA. Instead of staging through host memory, a capable NIC can directly read/write GPU memory over PCIe (across the GPU's BAR aperture), exposed to the RDMA stack either by the legacy nvidia-peermem peer-memory client or, increasingly, by the kernel dma-buf mechanism, which is now the recommended path and is also how AMD's amdgpu/ROCm exports device memory. That removes copies and CPU involvement, but it adds constraints: topology matters (NIC and GPU should sit under the same PCIe switch, ideally not crossing the CPU root complex or an inter-socket link), PCIe ACS redirection can force peer-to-peer traffic up to the root complex and kill it or slow it down, IOMMU settings can break the path, and BAR1 size can limit how much GPU memory is exposable at once.

The hard bug is stale registration. If a GPU buffer is freed or reused while the NIC still has a valid mapping or outstanding work request, the result can be silent corruption or a completion error much later. Production stacks use registration caches, invalidation callbacks (dma-buf move-notify / peer-memory invalidate), and strict synchronization between CUDA/HIP streams and RDMA completion.

What they're listening for: A senior candidate should know that registration is protection plus pinning/translation, that GPUDirect targets the BAR via peer-memory or dma-buf, and that ACS/PCIe topology is a correctness and performance boundary.
Follow-ups
  • When would you avoid registering per message?
  • What can go wrong with a registration cache?
  • Why does PCIe ACS break GPUDirect peer-to-peer?
Show a minimal verbs send path and identify the ordering and lifetime bugs that are easy to miss.senior

A minimal reliable-connected send path looks roughly like this:

struct ibv_sge sge = {
  .addr = (uintptr_t)buf,
  .length = len,
  .lkey = mr->lkey,
};
struct ibv_send_wr wr = {
  .wr_id = id,
  .sg_list = &sge,
  .num_sge = 1,
  .opcode = IBV_WR_SEND,
  .send_flags = IBV_SEND_SIGNALED,
};
struct ibv_send_wr *bad;
if (ibv_post_send(qp, &wr, &bad))
  return -1;

struct ibv_wc wc;
while (ibv_poll_cq(cq, 1, &wc) == 0) {}
if (wc.status != IBV_WC_SUCCESS)
  handle_error(wc.status);

The buffer and mr must remain valid until the work completion says the NIC is done with the SGE. ibv_post_send() only means the WR was accepted by the provider, not that bytes have left the NIC. If you unsignal too aggressively, you still need periodic signaled WRs or another mechanism to prevent send queue exhaustion and to learn about errors.

For receive-based operations, the peer must have receives posted before sends arrive or RC can hit receiver-not-ready behavior. For RDMA write, there is no remote receive completion by default, so application-level visibility usually needs an immediate (IBV_WR_RDMA_WRITE_WITH_IMM, which consumes a remote receive and posts a CQE), a separate send, or a higher-level protocol; a local write completion only means the NIC accepted/transmitted the request, not that the remote CPU has observed the bytes. Memory ordering with the CPU/GPU is also separate from transport completion; do not assume a CUDA/HIP kernel sees data without the right stream/event synchronization.

What they're listening for: The trap is confusing post, transmit, remote visibility, and local buffer lifetime; staff-level debugging often starts with those distinctions.
Follow-ups
  • Why not signal every WR on a hot path?
  • What does a successful RDMA write completion prove?
  • How do you recover from a QP entering error?
For all-reduce, compare ring and tree algorithms in the presence of 400G/800G NICs, GPU memory bandwidth, and tail latency.senior

Ring all-reduce is reduce-scatter plus all-gather. Across n ranks it takes 2(n-1) communication steps but each step moves only 1/n of the buffer, so total bytes per rank is fixed at roughly 2(n-1)/n of the message regardless of scale; that is why the ring is bandwidth-optimal for large messages. The cost is that the step count grows linearly with n, so latency dominates for small messages and a single slow link or rank holds the whole pipeline.

Tree algorithms (NCCL uses a double binary tree) reduce the dependency depth to about 2·log2(n) steps, so latency grows logarithmically. That wins for small and medium messages and for very large rank counts where the ring's 2(n-1) term hurts. Trees can underutilize bisection bandwidth for huge tensors, and internal tree nodes carry more traffic.

Orthogonal to the algorithm is the protocol: NCCL/RCCL pick Simple (large messages, full bandwidth), LL (8-byte flag-synced chunks, lowest latency, wastes ~half the bytes), or LL128 (128-byte units, near-LL latency with much better bandwidth on NVLink). On NVLink-connected pods, NVLS/SHARP-style in-switch reduction can beat both. The library auto-tunes algorithm and protocol by message size, rank count, and topology, which is why a single "which is faster" answer is wrong.

At 400G/800G the bottleneck often moves off the wire: PCIe or scale-up link, GPU HBM read-modify-write bandwidth for the reduction, NIC packet rate, registration/cache behavior, or switch-buffer tail. A good engineer also asks which algorithm fails gracefully when one rank has a 99.99th-percentile delay.

What they're listening for: This probes whether the candidate can reason about collectives as dependency graphs (ring 2(n-1) bandwidth-optimal vs tree 2 log n latency) and knows protocol (LL/LL128/Simple) is a separate axis.
Follow-ups
  • Why can small all-reduces prefer trees?
  • What does LL128 trade off versus Simple?
  • How would you map ranks across dual-rail NICs?
Describe the incast problem in AI collectives and parameter exchange. Why is it worse than a simple many-to-one throughput test suggests?senior

Incast happens when many senders converge on the same receiver, switch port, PCIe path, or GPU memory target in a short time window. In AI workloads this can occur at collective phase boundaries, during gradient aggregation, checkpoint traffic, or synchronized parameter exchange.

The simple test says the receiver has N senders and a fixed egress bandwidth. The real problem is burst synchronization and shallow timing slack. If many ranks finish compute at nearly the same time, packets arrive faster than the receiver port can drain and faster than any end-to-end control loop (ECN/DCQCN runs over an RTT or more) can react. Switch queues absorb the microburst until they overflow; then ECN marks, PFC pause, drops, or go-back-N retransmission introduce tail latency. The collective magnifies that tail because all ranks wait for the delayed dependency, so one congested receiver throttles the whole job step.

Mitigations include topology-aware scheduling, staggering or chunking, adaptive routing or packet spraying where the transport supports it, ECN/DCQCN tuning, and avoiding single aggregation points. Receiver-driven schemes help directly: Ultra Ethernet's Receiver Credit Congestion Control (RCCC) issues credits so senders cannot all blast the victim at once, which targets incast better than a purely sender-reactive loop. The dangerous mitigation is adding huge buffers: it can reduce drops while increasing queueing delay enough to stall the job anyway (bufferbloat).

S1
S2
S3
S4
S5
S6
S7
S8
switch egress port
0/12
drops0
delivered0/48
goodput0.00/tick · 0%
tick 0 · N×1 = 8 pkts/tick offered vs 4 drained · buffer absorbing
TCP incast. N senders synchronise a burst at one egress port. While N ≤ 4 the port drains as fast as packets arrive and the buffer stays shallow. Push N past what the buffer can absorb in one RTT and the tail overflows: packets are dropped, each dropped flow stalls a full RTO before retrying, and aggregate goodput collapses even though the link is barely used. Drag N up to find the cliff; a bigger buffer pushes the cliff right but adds queueing latency.
What they're listening for: The candidate should identify synchronization, the control-loop-vs-burst timescale mismatch, and tail amplification, and ideally cite receiver-driven credit (RCCC) as the structural fix rather than just more bandwidth.
Follow-ups
  • How would you reproduce incast in a lab?
  • Why can bigger buffers hurt collectives?
  • Why is a receiver-credit scheme well matched to incast?
Explain DCQCN at a level useful for debugging a RoCEv2 AI fabric. What are the moving parts and what misconfiguration would you look for?staff

DCQCN is an end-to-end, rate-based congestion control for RoCEv2 with three roles. The CP (congestion point, the switch) marks packets with ECN CE when a queue crosses a (typically RED-style min/max) threshold. The NP (notification point, the receiver) sees CE-marked packets and returns CNP frames to the sender, usually rate-limited by a CNP timer. The RP (reaction point, the sender NIC) cuts its rate multiplicatively on a CNP and, in the absence of CNPs, recovers via fast recovery, then additive increase, then hyper-increase. It is AIMD driven by ECN, conceptually borrowed from QCN/DCTCP. PFC sits underneath as a last-resort lossless mechanism; a healthy design marks early enough that pause is rare.

The moving parts to get right are: DSCP/PCP classification into the RDMA priority, ECN marking thresholds, PFC enablement and headroom on the same priority, CNP generation/handling and its own QoS, NIC rate parameters (there are on the order of 15 knobs: initial rate, Kmin/Kmax, alpha update, byte-counter and timer for increase, hyper-increase), and routing symmetry.

The classic misconfigurations: if ECN marks later than PFC pauses (Kmin set above the PFC trigger), PFC fires first and you get pause storms instead of graceful rate cuts. If recovery is too conservative, throughput collapses after a transient. If CNPs are not put in a high-priority, ideally lossless class, they can be delayed or dropped exactly when congestion is worst, breaking the feedback loop. If DSCP is rewritten or mismatched across hops, RDMA traffic lands in the wrong queue unprotected. In debugging I confirm packets carry the intended DSCP/ECN bits, switches mark instead of drop, NICs report CNP tx/rx and rate changes, and pause counters stay flat in steady state. A single bad threshold on one tier can dominate the whole job.

What they're listening for: A staff answer names CP/NP/RP roles, ties Kmin/Kmax ECN thresholds to the PFC trigger ordering, and flags CNP QoS as a feedback-loop failure mode.
Follow-ups
  • What happens if the PFC threshold is crossed before the ECN threshold?
  • Why must CNP traffic get its own QoS treatment?
  • How would you tune for fast ramp without oscillation?
PFC is what makes RoCEv2 lossless, yet operators fear it. Explain the failure modes PFC introduces and the mechanisms that contain them.staff

PFC is per-priority (per 802.1p class) link-level pause: when a receiver's ingress buffer for a priority crosses a threshold, it sends a PAUSE telling the upstream port to stop sending that class. It must assert early enough to absorb packets already in flight (the headroom must cover the cable/MAC delay at line rate), which is why headroom buffer sizing is a real design parameter.

The failure modes are all about backpressure propagating beyond the real hotspot. Head-of-line blocking: pause stops an entire priority on a link, so flows that share the class but were not causing congestion get blocked too. Pause propagation / congestion spreading: a single slow receiver pauses its upstream switch, which fills and pauses its upstream, and the tree of paused links can reach far back toward unrelated senders. PFC deadlock: the worst case, a cyclic buffer dependency where switch A holds buffer its upstream needs while waiting on its downstream, around a loop. This can arise from topology (e.g. an up-down routing violation, or rerouting after a link failure) and, once formed, is self-sustaining and permanently stalls the class. Victim flows / PFC storm: a misbehaving or stuck NIC that never stops asserting PAUSE (or a buggy port) can freeze a class network-wide.

Containment: the PFC watchdog on switches detects a queue that has been paused-but-not-draining for too long and force-drops/disables PFC on it to break a deadlock or stuck-pause storm. Disjoint priorities limit HoL blocking blast radius. Deadlock-free routing (and avoiding loops on rerouting) prevents cycles. And the strategic answer is to lean on ECN/DCQCN so the rate loop reacts before queues ever hit the PFC threshold, keeping PFC as a rare last resort. Lossy RoCE with good selective-retransmit NICs is an increasingly viable alternative that sidesteps PFC entirely.

What they're listening for: Strong answers name HoL blocking, pause propagation, and specifically cyclic-buffer-dependency deadlock, then cite the PFC watchdog and ECN-first design as containment; weak answers just say PFC is lossless and stop.
Follow-ups
  • Why does headroom buffer have to cover the cable length?
  • How can a link failure create a routing loop that deadlocks PFC?
  • Why is lossy RoCE with selective retransmit becoming attractive?
Ultra Ethernet discusses packet spraying, flexible ordering, and path-aware congestion control. What problem is this trying to solve, and what must the NIC/transport provide to make it safe?staff

Traditional ECMP hashes a whole flow (5-tuple) onto one path. Large AI flows can collide on the same uplink while other paths sit idle, and a single elephant flow can never use more than one path's worth of bandwidth. Packet spraying distributes packets of one message across many paths so the fabric is used evenly and congestion stops depending on hash luck.

The price is reordering. UET handles this by exposing explicit reliability/ordering modes instead of forcing one model: RUD (Reliable Unordered Delivery, the default bulk mode) reassembles reliably but lets packets arrive out of order to the semantic layer; ROD (Reliable Ordered) preserves order for things like MPI match semantics; RUDI is reliable-unordered for idempotent ops at the highest scale; UUD is unreliable unordered. So spraying is safe because the transport tracks per-message completion and only the modes that need ordering pay for it; software sees a completed message, not the packet disorder. Spraying entropy can also be pinned (fix the entropy value) when a flow genuinely needs a single path.

To make this work the NIC needs reassembly/reorder context per message, sequence tracking, and fast loss recovery. UET pairs spraying with packet trimming (a congested switch truncates a packet to its header and forwards the stub so the receiver learns precisely which packet was dropped, enabling fast selective retransmit instead of a timeout) and per-path congestion control: NSCC (sender-based network-signal CC using ECN/RTT/CSIG in-band telemetry) plus RCCC (receiver-credit CC for incast). Schemes like REPS recycle the entropy values of paths that ACKed cleanly. The value proposition is tail behavior for synchronized collectives at scale, not just higher average throughput.

What they're listening for: The trap is praising spraying without the ordering model. A staff answer names the RUD/ROD/RUDI modes, packet trimming for precise loss signaling, and NSCC+RCCC, not just 'load balancing'.
Follow-ups
  • Why is per-message completion the key to making spraying invisible to software?
  • How does packet trimming beat a retransmit timeout?
  • When would you pin the entropy value instead of spraying?
Scale-up fabrics such as UALink and scale-out fabrics such as Ultra Ethernet solve different parts of the AI cluster problem. Draw the boundary and explain why it matters to a NIC driver engineer.senior

Scale-up connects accelerators inside a tightly coupled pod or rack-scale domain with memory-semantic, low-latency, high-bandwidth behavior: load/store and atomics between accelerators over short reach, often a switched fabric (UALink targets hundreds of accelerators in a pod, competing with NVLink/NVSwitch). Scale-out connects many nodes or pods over a routable network; Ultra Ethernet targets Ethernet optimized for AI/HPC scale-out across NICs, switches, optics, and software.

The boundary matters because the semantics differ. A scale-up link cares about cache/memory ordering, accelerator-to-accelerator load/store and atomic access, and very short reach, so it can assume near-lossless behavior and skip a heavy software transport. A scale-out NIC cares about packetization, congestion control, loss recovery, virtualization/multi-tenancy, telemetry, security, and integration with Linux, libfabric/verbs, and RCCL/NCCL. Collective libraries deliberately exploit both: reduce within the scale-up domain at NVLink/UALink bandwidth, then do the smaller cross-node step over the scale-out fabric (hierarchical collectives).

For a NIC driver engineer, the two worlds meet at topology discovery, memory registration, peer DMA, failure reporting, and the collective library picking the right path. A bug in NUMA placement, PCIe peer access (ACS), CQ moderation, or device reset can silently demote a scale-up-capable transfer onto the slow scale-out path, and the application just sees mysterious bandwidth loss without an error.

What they're listening for: A strong candidate separates memory-semantic short-reach scale-up (UALink/NVLink) from routable scale-out transport (UEC), and connects it to hierarchical collectives and driver-visible demotion failures.
Follow-ups
  • What should a topology API expose to RCCL/NCCL?
  • Why do hierarchical collectives reduce intra-node first?
  • What reset behavior is dangerous with peer DMA?