Every low-latency trick on this team is the same move: delete work the result never depended on. Toggle each technique below and watch the nanoseconds drop β then learn to spot them yourself.
The senior habit: walk the path, question every stage
say this framing in the room
You don't memorize which acronym saves time β you derive it. Walk a packet from the application to the wire and, at every stage, ask one question: βIs this work the wire actually needs, or is it generality/safety tax I can remove on this hot path?β
What does the wire truly need? A framed packet, DMA'd to the NIC, serialized by the PHY. That's it.Everything else on the path is overhead we tolerate for safety or generality. Each piece is a candidate to cut.
The send() syscall crosses into the kernel β a mode switch the data itself doesn't need.Kernel bypass: run the stack in userspace and talk to the NIC ring directly.
The kernel copies my buffer into an sk_buff and runs a general-purpose stack.Zero-copy + lean userspace stack (Onload/ef_vi): the bytes are sent from where they already are.
After I ring the doorbell, the NIC reads the descriptor back across PCIe β a full round-trip just to learn what I already knew.TX_PUSH: write the descriptor + data inline with the doorbell, killing the read.
The NIC waits for the whole frame to cross PCIe before it starts transmitting.CTPIO cut-through: start streaming onto the wire as the bytes arrive.
Now watch each of those cuts happen.
Demo 1 Β· The transmit path
The baseline is the ordinary kernel send path. Turn the techniques on β in the order a latency engineer would reach for them β and watch the bar shrink. (Nanoseconds are illustrative β the relative shape is the point, not the absolute values.)
1230 nsbaseline Β· 1230 ns Β· toggle a technique to shave it
90
220
110
300
250
send() syscall + user/kernel mode switch90 ns
kernel TCP/IP + sk_buff alloc220 ns
copy payload into kernel buffer80 ns
softirq / scheduler wakeup110 ns
driver writes TX descriptor to ring40 ns
NIC reads descriptor back over PCIe300 ns
doorbell MMIO write60 ns
DMA payload across PCIe, then start TX250 ns
SerDes / PHY puts bits on the wire80 ns
The shape to internalize:kernel bypass removes the biggest chunk (it's pure host-side tax), but the hardwaretricks β TX_PUSH killing the PCIe descriptor read, and CTPIO overlapping DMA with transmit β are what take you from βfast softwareβ to βSolarflare fast.β This is exactly the descriptor-ring datapath, costed.
Demo 2 Β· CTPIO cut-through, frame by frame
Why does cut-through matter so much? Press play. The top lane waits for the entire frame to cross PCIe before the first bit hits the wire; the bottom lane (CTPIO) starts transmitting as soon as bytes arrive β so the packet is leaving while it's still arriving.
Store-and-forward β wait for the last byte, then transmit
First bit on the wire at t = 15 β only after the full frame has crossed PCIe.
CTPIO cut-through β start transmitting as bytes arrive
First bit on the wire at t = 2 β the NIC streams onto the wire while the rest of the frame is still arriving.
t = 0 Β· cut-through finishes ~14Γ sooner to first byte
not yet arrived
arrived over PCIe
transmitting on the wire
The catch a senior names unprompted:cut-through means you start sending before you've seen the whole frame β so if the DMA under-runs or the frame turns out bad, you've already committed bytes to the wire. The NIC has to handle that (e.g. deliberately corrupt the FCS so the receiver drops it). Latency wins always cost you something; know the price.
Demo 3 Β· The receive path β interrupt vs. busy-poll
Receiving has its own tax: the interrupt. An interrupt means the CPU has to be told a packet arrived, take the IRQ, schedule softirq/NAPI, and wake your sleeping thread. If a core is already spinning on the ring, all of that vanishes β you trade a burned CPU core for latency.
890 nsbaseline Β· 890 ns Β· toggle a technique to shave it
200
140
170
260
90
NIC DMAs the packet into host memory200 ns
raise MSI-X interrupt140 ns
IRQ entry + NAPI schedule170 ns
scheduler wakes the blocked app260 ns
deliver to app (copy / syscall return)90 ns
app reads the descriptor from the ring30 ns
The tradeoff to state out loud:busy-polling isn't free β it pins a core at 100% and burns power. You do it on the hot path where a microsecond is worth a core; you don't do it for a bulk file transfer. Senior answers always name what the optimization costs, not just what it saves.
The one-line framework
At every stage on a hot path, ask: βIs this work the result depends on, or is it generality/safety tax I can remove here?β Kernel bypass, TX_PUSH, CTPIO, busy-poll, zero-copy, cache-line padding β every one of them is an answer to that single question. If you can derivethe technique from the question instead of reciting its name, you're interviewing like a senior.