Interactivelatency shavingthinking style

Where the microseconds go

Every low-latency trick on this team is the same move: delete work the result never depended on. Toggle each technique below and watch the nanoseconds drop — then learn to spot them yourself.

The senior habit: walk the path, question every stage

say this framing in the room

You don't memorize which acronym saves time — you derive it. Walk a packet from the application to the wire and, at every stage, ask one question: “Is this work the wire actually needs, or is it generality/safety tax I can remove on this hot path?”

What does the wire truly need? A framed packet, DMA'd to the NIC, serialized by the PHY. That's it.Everything else on the path is overhead we tolerate for safety or generality. Each piece is a candidate to cut.
The send() syscall crosses into the kernel — a mode switch the data itself doesn't need.Kernel bypass: run the stack in userspace and talk to the NIC ring directly.
The kernel copies my buffer into an sk_buff and runs a general-purpose stack.Zero-copy + lean userspace stack (Onload/ef_vi): the bytes are sent from where they already are.
After I ring the doorbell, the NIC reads the descriptor back across PCIe — a full round-trip just to learn what I already knew.TX_PUSH: write the descriptor + data inline with the doorbell, killing the read.
The NIC waits for the whole frame to cross PCIe before it starts transmitting.CTPIO cut-through: start streaming onto the wire as the bytes arrive.

Now watch each of those cuts happen.

Demo 1 · The transmit path

The baseline is the ordinary kernel send path. Turn the techniques on — in the order a latency engineer would reach for them — and watch the bar shrink. (Nanoseconds are illustrative — the relative shape is the point, not the absolute values.)

1230 nsbaseline · 1230 ns · toggle a technique to shave it

send() syscall + user/kernel mode switch90 ns

kernel TCP/IP + sk_buff alloc220 ns

copy payload into kernel buffer80 ns

softirq / scheduler wakeup110 ns

driver writes TX descriptor to ring40 ns

NIC reads descriptor back over PCIe300 ns

doorbell MMIO write60 ns

DMA payload across PCIe, then start TX250 ns

SerDes / PHY puts bits on the wire80 ns

The shape to internalize:kernel bypass removes the biggest chunk (it's pure host-side tax), but the hardwaretricks — TX_PUSH killing the PCIe descriptor read, and CTPIO overlapping DMA with transmit — are what take you from “fast software” to “Solarflare fast.” This is exactly the descriptor-ring datapath, costed.

Demo 2 · CTPIO cut-through, frame by frame

Why does cut-through matter so much? Press play. The top lane waits for the entire frame to cross PCIe before the first bit hits the wire; the bottom lane (CTPIO) starts transmitting as soon as bytes arrive — so the packet is leaving while it's still arriving.

Store-and-forward — wait for the last byte, then transmit

First bit on the wire at t = 15 — only after the full frame has crossed PCIe.

CTPIO cut-through — start transmitting as bytes arrive

First bit on the wire at t = 2 — the NIC streams onto the wire while the rest of the frame is still arriving.

t = 0 · cut-through finishes ~14× sooner to first byte

not yet arrived

arrived over PCIe

transmitting on the wire

The catch a senior names unprompted:cut-through means you start sending before you've seen the whole frame — so if the DMA under-runs or the frame turns out bad, you've already committed bytes to the wire. The NIC has to handle that (e.g. deliberately corrupt the FCS so the receiver drops it). Latency wins always cost you something; know the price.

Demo 3 · The receive path — interrupt vs. busy-poll

Receiving has its own tax: the interrupt. An interrupt means the CPU has to be told a packet arrived, take the IRQ, schedule softirq/NAPI, and wake your sleeping thread. If a core is already spinning on the ring, all of that vanishes — you trade a burned CPU core for latency.

890 nsbaseline · 890 ns · toggle a technique to shave it

NIC DMAs the packet into host memory200 ns

raise MSI-X interrupt140 ns

IRQ entry + NAPI schedule170 ns

scheduler wakes the blocked app260 ns

deliver to app (copy / syscall return)90 ns

app reads the descriptor from the ring30 ns

The tradeoff to state out loud:busy-polling isn't free — it pins a core at 100% and burns power. You do it on the hot path where a microsecond is worth a core; you don't do it for a bulk file transfer. Senior answers always name what the optimization costs, not just what it saves.

The one-line framework

At every stage on a hot path, ask: “Is this work the result depends on, or is it generality/safety tax I can remove here?” Kernel bypass, TX_PUSH, CTPIO, busy-poll, zero-copy, cache-line padding — every one of them is an answer to that single question. If you can derivethe technique from the question instead of reciting its name, you're interviewing like a senior.

← The mechanism

Ring buffers in the NIC datapath

Go deep →

The senior question bank