๐Ÿ“ˆPerformance and Latency Engineering

Probes whether the candidate can measure, explain, and reduce latency without fooling themselves or damaging correctness.

Choose what's cached, then load โ€” see the cycles to the level that hits.

L1
holds the line
~32 KB, per-core. The fast path.
present+1 cyc
L2
holds the line
~256 KB-1 MB, per-core. A few cycles.
present+4 cyc
L3
holds the line
Shared, MBs. Tens of cycles.
present+12 cyc
DRAM
holds the line
Main memory. The cliff: ~100 cycles.
present+100 cyc
0 cycready
target hit: L1 @ 1 cyc
You need to measure a 200 ns fast-path function. How do you use `rdtsc`/`rdtscp` correctly, and what are the traps?staff

Use an invariant TSC if the platform provides one (check the invariant_tsc/constant_tsc + nonstop_tsc CPUID/flags), pin the thread, avoid migration, and measure enough iterations to amortize timestamp overhead. rdtsc is not serializing, so use lfence; rdtsc before the region on modern x86 when you need prior instructions complete, and rdtscp; lfence after when you need the timestamp ordered before later work. cpuid is stronger but much more expensive and can dominate a 200 ns measurement.

Also measure the empty harness and subtract it, report distributions rather than only mean, and convert cycles using the measured or known invariant-TSC frequency, not instantaneous turbo frequency. The invariant TSC ticks at a fixed rate regardless of P-state, so cycles-to-nanoseconds is stable, but it does NOT track the core's actual execution clock, so 'cycles' from TSC is wall-clock time, not core cycles. For core cycles use perf / CPU_CLK_UNHALTED. Disable or account for C-states, SMT interference, interrupts, page faults, and branch predictor warmup.

static inline uint64_t tsc_start(void) {
    unsigned lo, hi;
    __asm__ __volatile__("lfence\n\trdtsc" : "=a"(lo), "=d"(hi) :: "memory");
    return ((uint64_t)hi << 32) | lo;
}

static inline uint64_t tsc_stop(void) {
    unsigned lo, hi, aux;
    __asm__ __volatile__("rdtscp\n\tlfence" : "=a"(lo), "=d"(hi), "=c"(aux) :: "memory");
    return ((uint64_t)hi << 32) | lo;
}
What they're listening for: A strong answer treats timestamping as part of the experiment, including serialization, migration, and overhead subtraction, and knows TSC frequency != core frequency.
Follow-ups
  • Why can `rdtscp` still be insufficient at the beginning of a region?
  • How do you detect cross-core TSC problems?
  • What would you use on Arm?
A benchmark claims p99.9 latency improved after batching RX completions. What methodology checks whether the result is real?staff

I would first check the load model. A closed-loop client that sends the next request only after receiving the previous response can hide stalls through coordinated omission: if the system stalls 5 ms, a closed-loop client simply does not send during the stall, records one slow sample instead of the hundreds an open-loop sender would have queued, and the tail looks artificially clean. For latency claims, use an open-loop / constant-arrival process, record intended send time and completion time, and correct or explicitly report missed intervals. Tools that do this correctly include wrk2, Vegeta, and HdrHistogram's coordinated-omission correction (record with an expected interval).

Then check sample size: p99.9 needs many independent samples; 100,000 samples only gives about 100 points beyond p99.9, and burst correlation makes that weaker. Report confidence or repeat runs, not a single percentile.

For batching specifically, separate service time from queueing time. Batching can improve throughput and median by amortizing MMIO/cache work while increasing wait time for the first packet in a batch, so a 'p99.9 improved' claim under a closed loop is exactly the result coordinated omission would fabricate. Measure p50, p99, p99.9, max, drops, offered load, achieved load, and queue depth under steady state and incast.

One fixed cost amortized across many operations.
One fixed cost amortized across many operations.
What they're listening for: This probes whether they understand tail latency as a measurement discipline, not just a number printed by a benchmark. Naming wrk2/HdrHistogram and the open-loop fix is the signal.
Follow-ups
  • How would you detect coordinated omission in an existing tool?
  • What changes if packets arrive from hardware rather than a load generator?
  • When is max latency more useful than p99.9?
Walk through a `perf` workflow for a packet processing loop that regressed 8% throughput and 30% p99 latency.senior

Start with perf stat to see direction: cycles, instructions, IPC, branches, branch-misses, cache-references, cache-misses, LLC events if available, context switches, migrations, and page faults. Use repeated runs and the same CPU pinning. A throughput regression usually shows up as fewer instructions retired per cycle or more of them.

Then use perf record -g or LBR call graphs (--call-graph lbr) to find where cycles moved. If the loop is small, use perf annotate to inspect instructions and source lines. If the regression smells like memory, sample precise load/store or LLC-miss events via PEBS (perf record -e mem_load_retired.l3_miss:pp); generic event names are not always precise or available across vendors, so confirm with perf list.

Finally correlate with latency, because the 8% and the 30% may have different root causes: a throughput regression may be extra instructions or a new cache miss, while a p99 regression may be rare page faults, IRQs, lock contention, or cache-line bouncing that barely move the average. perf sched, perf c2c, ftrace, or eBPF are usually more useful than CPU flame graphs for tail spikes. I would not assume one fix addresses both numbers.

What they're listening for: A senior answer knows `perf stat` and `perf record` answer different questions, that PMU event availability is CPU-specific, and that the throughput and tail regressions may be unrelated.
Follow-ups
  • What does lower IPC mean in a polling loop?
  • How can sampling perturb a low-latency workload?
  • When would you prefer LBR call graphs?
How would you use `perf c2c` to prove false sharing in a NIC queue data structure?senior

perf c2c is for cache-to-cache/HITM analysis: it identifies cache lines with heavy contention and shows the offsets within the line and the call paths of the readers and writers. I would run a representative workload pinned to the suspected cores, record with call graphs, and look for lines with high remote HITM (a load that hit a modified line in another core's cache).

The proof is not just 'line is hot'. I would map the line back to the struct, show two independently updated fields living in the same 64-byte line at different offsets (perf c2c prints the offsets, which is the smoking gun), and show the contention disappears or moves after padding or ownership separation. I would also rule out true sharing, such as a lock or index both sides genuinely must touch, where padding will not help.

A typical command is perf c2c record -ag -- workload followed by perf c2c report, with symbols and debuginfo installed. On AMD, the Owned (MOESI) state means dirty lines are forwarded cache-to-cache rather than via DRAM, so the HITM accounting and source labels can differ from an Intel MESIF box; the offset evidence is what I trust regardless of vendor.

What they're listening for: The trap is confusing a hot cache line with false sharing; the candidate should prove independent ownership via the within-line offsets, not just heat.
Follow-ups
  • What if `perf c2c` is unavailable in production?
  • How do SMT siblings affect HITM interpretation?
  • What layout change would you try first?
When would you use ftrace or eBPF instead of `perf record` for a networking latency spike?senior

Use perf record when you need statistical CPU attribution (where are cycles spent). Use ftrace/eBPF when you need event timing, kernel path visibility, or histograms around specific events: NAPI poll duration, softirq entry/exit, IRQ handler timing, scheduler switches, page faults, syscalls, or driver functions, things that are rare and time-correlated rather than cycle-heavy.

ftrace via tracefs (/sys/kernel/tracing) is good for function/function_graph tracing and built-in latency tracers (irqsoff, wakeup), but it can produce huge output and perturb the system. eBPF/bpftrace is good for filtered probes and in-kernel aggregation, for example histograms keyed by queue id or CPU computed in the kernel so you never ship per-event records to userspace. Prefer tracepoints over kprobes when possible because tracepoints are a more stable ABI; kprobes attach to function symbols and inlining/renaming breaks them across kernel versions.

For p99 spikes, aggregate in the kernel and emit summaries, not per-packet events. Printing from probes is often the new bottleneck and will hide the very stall you are chasing.

What they're listening for: This checks tool choice and overhead awareness, especially for rare tail events.
Follow-ups
  • Give a bpftrace histogram you would write for softirq latency.
  • Why are kprobes fragile across kernel versions?
  • How do you validate tracing overhead?
Design CPU isolation for a userspace polling datapath. Which kernel parameters and affinity settings matter, and what can still interrupt you?staff

A typical setup reserves housekeeping cores and isolates polling cores. Boot parameters may include isolcpus= for scheduler isolation, nohz_full= for full dynticks on isolated CPUs, rcu_nocbs= to offload RCU callbacks to housekeeping cores, and irqaffinity= to set default IRQ placement. Then set NIC IRQ affinity explicitly, pin poll threads with sched_setaffinity (or taskset/cset), and keep kernel workers, timers, and application helper threads off the datapath cores.

But isolation is not absolute. nohz_full removes the periodic scheduler tick only when exactly one task is runnable on the CPU; a second runnable task brings the tick back. Even when idle-of-others, a residual 1 Hz tick used to fire on the isolated CPU for scheduler stats (modern kernels offload it to housekeeping via an unbound workqueue), and vm.stat_interval (default 1 s) will still wake the CPU unless raised. On top of that: managed IRQs, NMIs, machine checks, SMI/firmware events, thermal throttling, TLB shootdowns (IPIs from other cores' munmap/mprotect), page faults, and perf sampling can still cause jitter.

The validation is empirical: trace interrupts and scheduling on the isolated CPU, inspect /proc/interrupts deltas, check chrt/taskset state, raise vm.stat_interval, and run cyclictest or an application-level latency histogram for hours before trusting the setup. A clean cyclictest max in single-digit microseconds is the bar.

What they're listening for: A staff-level answer avoids treating boot flags as magic, names the nohz_full single-task precondition and the 1 Hz / vmstat residue, and includes verification of residual jitter sources.
Follow-ups
  • Why keep at least one housekeeping core per NUMA node?
  • How do managed MSI-X interrupts complicate affinity?
  • What does `rcu_nocbs` actually move?
Compare busy-polling, NAPI interrupts, and interrupt coalescing for low-latency Ethernet.senior

Interrupts save CPU when traffic is sparse but add wakeup, interrupt, softirq, and scheduling latency on every event. NAPI mitigates interrupt storms by disabling the IRQ and switching to polling under load, then re-arming when the queue drains. Interrupt coalescing reduces interrupt rate by waiting for a time or frame threshold (ethtool -C rx-usecs / rx-frames), improving throughput and CPU efficiency while adding latency that is often directly visible in p99; for a pure latency test you set rx-usecs low or zero and accept the higher interrupt rate.

Busy-polling burns a core to remove wakeup latency. In Linux sockets, SO_BUSY_POLL sets a per-socket busy-poll budget in microseconds for blocking receive when no data is available, with /proc/sys/net/core/busy_read and busy_poll as defaults. Kernel-bypass APIs (ef_vi, XDP/AF_XDP, DPDK) go further by polling the device queues directly in userspace with no syscall per packet.

The right answer depends on arrival pattern and CPU budget. For steady market-data-like feeds or AI-cluster hot paths, busy-polling is justified. For sparse control traffic, it wastes power and can increase system jitter. A common production design is a hybrid: busy-poll a small window, then fall back to interrupts so an idle core can sleep.

What they're listening for: The candidate should present a latency/CPU/throughput tradeoff, know concrete Linux knobs, and mention the hybrid poll-then-sleep pattern.
Follow-ups
  • How would you set `ethtool -C` for a latency test?
  • Why can disabling coalescing reduce throughput?
  • How does busy polling interact with CPU frequency scaling?
Hugepages improved average throughput but worsened rare latency spikes. Explain plausible causes and how you would test them.senior

Hugepages reduce TLB pressure by covering more memory per TLB entry (2 MiB or 1 GiB vs 4 KiB), which helps large packet buffers, flow tables, and rings. But Transparent Huge Pages (THP) can introduce compaction, background khugepaged scanning, and page promotion/demotion that appears as rare latency spikes. Explicit HugeTLB pages avoid that background machinery but require up-front reservation, NUMA-aware allocation, and operational discipline.

I would compare 4 KiB pages, explicit 2 MiB or 1 GiB hugetlb pages, and THP modes (always, madvise, never) with page faults and compaction counters monitored. Check perf stat dTLB/iTLB-load-miss events if available, /proc/vmstat (compact_stall, thp_*), /proc/meminfo, and application histograms. Pin memory with mlockall, prefault buffers at startup so faults do not happen on the hot path, and ensure the hugepage pool is allocated on the NIC-local NUMA node.

For a latency-sensitive datapath I would usually set THP to never (or madvise and not advise the hot regions) and use reserved hugetlb pages instead, so allocation and placement are deterministic. The key is to avoid changing allocator behavior, NUMA placement, and TLB reach all at once without observability.

What they're listening for: This probes whether they know hugepages are not a free win for tail latency, especially THP, and that the fix is usually reserved hugetlb + THP off + prefault.
Follow-ups
  • When would 1 GiB pages be risky?
  • How does first-touch affect hugepage placement?
  • What is the difference between THP and hugetlbfs operationally?
A flame graph shows most samples in packet parsing, but p99 spikes are caused by occasional scheduler activity. How do you avoid drawing the wrong conclusion?senior

A CPU flame graph is sample-weighted by on-CPU time, so it is excellent for hot code but poor at explaining rare off-CPU stalls unless you collect the right data. A function consuming 60% of cycles may have nothing to do with a 200 us p99 spike caused by migration, page fault, IRQ, or lock wait, because the stall is off-CPU and a CPU profiler does not sample a sleeping thread.

I would build separate views: an on-CPU flame graph for steady-state cost, an off-CPU flame graph (scheduler tracing / sched_switch with stacks) for stalls, and a latency-triggered trace around outliers. For packet paths, correlate each outlier with CPU id, queue id, IRQ/softirq activity, context switches, page faults, and ring occupancy.

The rule is to align the tool with the question: averages and CPU samples optimize throughput; event timelines, off-CPU analysis, and histograms explain tail latency. Optimizing the hottest function here would burn effort and not move p99 at all.

What they're listening for: This checks whether they can resist optimizing the hottest function when the business problem is an off-CPU tail, and whether they know to reach for off-CPU/scheduler tracing.
Follow-ups
  • How would you trigger tracing only for requests above 50 us?
  • What does an off-CPU flame graph show?
  • How can frame-pointer omission affect flame graphs?
C-states and frequency scaling are usually invisible, but they wreck tail latency on a polling datapath. Explain the mechanisms and exactly how you would lock the machine down.staff

Two separate effects. C-states are idle states: a core that goes idle (via mwait/hlt) powers down, and waking from a deep C-state (C3/C6) costs from a few to tens of microseconds, the latency value in /sys/devices/system/cpu/cpu*/cpuidle/state*/latency. A busy-poll loop normally never idles, but any brief gap (a syscall, a blocking call, a lull in traffic) can let the core nap, and the next packet eats the wake-up latency as a p99/p999 spike. P-states are the orthogonal axis: the running frequency ramps up under load, so the first packets after an idle window run at a low frequency until the governor and turbo logic respond, adding latency at exactly the wrong moment.

To lock it down:

  • Cap idle depth: boot with intel_idle.max_cstate=1 (or processor.max_cstate=1) / on AMD limit cstates, or at runtime hold /dev/cpu_dma_latency open writing a 0 (PM-QoS) so the kernel will not enter C-states above the budget on those cores.
  • Pin frequency: set the performance governor, and on intel_pstate set min_perf_pct=100 (or disable turbo and pin max=min) so the core does not downclock. On AMD, set the governor to performance and disable cppc/cstate idling for the isolated cores.
  • Disable SMT on the datapath cores, disable deep package C-states in BIOS, and disable thermal/power capping that can throttle under sustained load.
  • Verify with turbostat (watch Busy%, Bzy_MHz, CPU%c1/c6, PkgWatt) and cpupower idle-info; the isolated cores should sit at ~100% C0 at a flat max frequency.

The tradeoff is power and heat: holding C0 at max frequency on idle cores burns watts and can reduce all-core turbo headroom for the rest of the socket. That is the deal you make for deterministic tail latency.

What they're listening for: Staff-level: they separate C-state wake latency from P-state ramp, name /dev/cpu_dma_latency and the pstate knobs, and verify with turbostat rather than assuming the BIOS did it.
Follow-ups
  • Why does /dev/cpu_dma_latency=0 differ from intel_idle.max_cstate=1?
  • How does disabling C-states affect the rest of the socket's turbo budget?
  • What does turbostat show on a correctly pinned core?
Write a small open-loop latency harness in C: it must send at a fixed rate and avoid coordinated omission. Walk through why each part matters.staff

The core idea: the send schedule is driven by a fixed clock, not by when the previous reply arrived. Each request has an intended send time; if the system stalls and you fall behind, you record latency from the intended time, not from when you actually got around to sending. That single choice is what defends against coordinated omission, the same correction HdrHistogram and wrk2 apply.

uint64_t interval_ns = 1000000000ULL / target_rate;
uint64_t next = now_ns();
for (uint64_t i = 0; i < n; i++) {
    next += interval_ns;              /* intended send time, fixed cadence */
    uint64_t t = now_ns();
    if (t < next) busy_wait_until(next);
    uint64_t sent = now_ns();
    send_request(i);
    uint64_t done = wait_for_reply(i);
    /* measure from the INTENDED time, not from sent */
    hdr_record_value(h, done - next);
}

Why each part matters:

  • next += interval (not next = now + interval) keeps the cadence absolute, so a stall does not slide the whole schedule and erase the backlog.
  • Recording done - next (intended-to-completion) charges the request for the queueing delay an open-loop client would really have seen. Recording done - sent would reintroduce coordinated omission.
  • Busy-wait, not sleep(), because sleep/nanosleep granularity and wakeup latency are themselves tens of microseconds and would dominate a low-latency measurement.
  • One outstanding request here is for clarity; a real harness keeps many in flight (separate send and receive threads, or async) so the offered load is independent of service time, which is the whole point of open-loop.
  • Feed an HdrHistogram (or equivalent) so you keep full distribution and can report p99/p999/max, and ideally use its built-in coordinated-omission correction as a cross-check.
What they're listening for: They must drive sends from an absolute clock and measure from the intended time. Measuring from actual-send, or pacing with sleep, are the two dead giveaways of a flawed harness.
Follow-ups
  • Why keep many requests in flight instead of one?
  • What does HdrHistogram's coordinated-omission correction do internally?
  • How would you separate network RTT from server service time in these numbers?
How do RSS, RPS/RFS, and aRFS steer received packets to cores, and how do you configure them so a flow lands on the CPU where its application thread runs?senior

RSS (Receive Side Scaling) is hardware: the NIC hashes packet headers (a Toeplitz hash over the 4-tuple) and uses an indirection table to pick an RX queue, each queue tied to an MSI-X vector and thus an IRQ. You shape it with ethtool -X (indirection table / hash key) and ethtool -N (flow-type and field steering). RSS spreads load across queues but is oblivious to where the consuming thread actually runs.

RPS is the software equivalent done in the kernel after the IRQ, redistributing packets to CPUs via a per-queue rps_cpus mask, useful when the NIC has fewer queues than cores. RFS (Receive Flow Steering) makes RPS application-aware: when a thread calls recvmsg, the kernel records the CPU in a flow table keyed by the connection hash, and subsequent packets for that flow are steered to that CPU, raising data-cache hit rate by processing the packet where it will be consumed. aRFS (Accelerated RFS) pushes that decision into the hardware: a driver implementing ndo_rx_flow_steer programs the NIC to deliver the flow's packets to the RX queue whose IRQ affinity maps to the consuming CPU, so you get RFS locality with no software redistribution.

To get a flow onto its application's core: set RX-queue IRQ affinity to the target cores, pin the application threads to those same cores, and enable aRFS (ethtool -K dev ntuple on, set rps_flow_cnt). aRFS deduces the CPU-to-queue map from IRQ affinity, so correct IRQ pinning is the prerequisite. For a kernel-bypass datapath you skip all of this and steer flows to userspace queues directly. The goal in every case is the same: NIC, IRQ, and consumer thread on the same core/NUMA node so the packet, its descriptor, and the app land in one cache.

What they're listening for: Senior: they distinguish hardware RSS from software RPS/RFS, know aRFS needs IRQ affinity to be right first, and tie it back to cache locality (consume where it lands).
Follow-ups
  • Why can RSS alone hurt cache locality even though it balances load?
  • What breaks aRFS for encapsulated or encrypted traffic the NIC cannot parse?
  • How does flow steering interact with the NIC-local NUMA node?
Give a bpftrace one-liner (or short script) that produces a latency histogram for an event in the network stack, and explain why in-kernel histogramming beats logging each event.senior

Measure NAPI poll duration with a histogram computed in the kernel, emitting only the summary:

bpftrace -e '
tracepoint:napi:napi_poll { @start[tid] = nsecs; }
tracepoint:napi:napi_poll /@start[tid]/ {
    @ns = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}'

A cleaner pattern uses an entry/return pair on the function (kprobe/kretprobe on napi_poll, or any driver function), storing the start timestamp keyed by thread id and recording hist(delta) on return. hist() buckets into power-of-two ranges in kernel memory; you read the map at the end. To attribute spikes, key the map by something useful, for example @[cpu] = hist(delta) or by queue id, so you see which CPU or queue owns the tail.

Why in-kernel aggregation wins: logging every event ships a record to userspace per packet, which at millions of packets per second is enormous overhead, perturbs the very latency you are measuring (observer effect), can drop records under load (losing exactly the rare tail samples you care about), and serializes on the ring buffer. Computing the histogram in the kernel touches a few cache lines per event and transfers only the buckets, so it scales to line rate and is far less invasive. The rule: aggregate in the kernel, summarize to userspace, never per-event-print on a hot path.

What they're listening for: They should produce a working hist() probe, key it by cpu/queue to find the tail, and articulate the observer effect and record-drop argument for in-kernel aggregation.
Follow-ups
  • How would you trace only events above a threshold to catch outlier stacks?
  • What overhead does a kretprobe add versus a tracepoint?
  • Why might you prefer a fixed-bucket histogram over hist() in production?