Not loops to watch β demos you drive. Toggle, step, drag, and break each idea across the whole stack: the NIC datapath, the C memory model, CPU microarchitecture, and networking / AI-fabric concepts.
The NIC datapath
Drive a packet between the host and the wire.
Descriptor ring (RX/TX)
Post buffers, ring the doorbell, watch the NIC DMA and complete β toggle RX vs TX.
MMIO doorbell idle
HEAD 0 β hardware-owned (NIC consumes) TAIL 0 β driver-owned (driver posts) OWN0 β handed to NIC DD1 β done, awaiting reap
Ring initialized β 8 descriptors, all FREE.
Barrier before the doorbell: the descriptor writes must be globally visible before the doorbell MMIO write, or the NIC can DMA a half-written descriptor. The driver issues a write barrier (wmb()), then the posted doorbell write. Completions are learned from the DMA'd OWN/DD bit in host memory β never by polling a NIC register.
Send path latency
Toggle kernel-bypass / TX_PUSH / CTPIO and watch the nanoseconds drop.
1230 nsbaseline Β· 1230 ns Β· toggle a technique to shave it
90
220
110
300
250
send() syscall + mode switch90 ns
kernel TCP/IP + sk_buff220 ns
copy into kernel buffer80 ns
softirq / scheduler wakeup110 ns
driver writes TX descriptor40 ns
NIC reads descriptor over PCIe300 ns
doorbell MMIO write60 ns
DMA across PCIe, start TX250 ns
SerDes / PHY onto the wire80 ns
Interrupt vs busy-poll (RX)
Toggle busy-poll to delete the IRQ β softirq β wakeup tax.
890 nsbaseline Β· 890 ns Β· toggle a technique to shave it
200
140
170
260
90
NIC DMAs the packet into host memory200 ns
raise MSI-X interrupt140 ns
IRQ entry + NAPI schedule170 ns
scheduler wakes the app260 ns
deliver to app (copy / return)90 ns
app reads the descriptor30 ns
MMIO write vs read
A posted write is fire-and-forget; a non-posted read stalls the core.
Posted writes(e.g. a doorbell) β fire & forget
W0
W1
W2
W3
W4
issuingβ¦
The CPU issues each write and immediately moves on β the write drains to the device asynchronously.
Non-posted reads (an MMIO register read) β CPU stalls for the completion
stallβ¦
R1
R2
R3
R4
0/5 done
Each read is a full PCIe round-trip; the core cannot retire it until the completion returns β so reads serialize and stall.
t = 0 Β· 5 writes finish ~6Γ sooner than 5 reads
The takeaway for the datapath: never poll a NIC register on the hot path. You ring a doorbell(a posted write that doesn't stall), and you learn about completions from DMA'd status / events in host memory β never from an MMIO read. One stray read can cost more than the whole rest of the send.
IOMMU / IOVA
DMA to an IOVA β translated and bounds-checked, or blocked.
device issues DMA to:
IOMMU page table β per-device (IOVA β PA)
0x1000β0x4000RX ring buffer
0x2000β0x8000TX ring buffer
other IOVAsβfaultno PTE β DMA blocked
click an IOVA β the IOMMU decides ALLOW or BLOCK
Β
Why the IOMMU matters: it gives each device its own address space. Unmapped or out-of-range DMA is blocked, so a compromised NIC can't read or write arbitrary host memory β the same hardware that lets you safely pass a device through to a VM/userspace (VFIO). With it disabled (passthrough) a stray IOVA like 0xf000 could land in 0x0100 (KERNEL text (host memory)).
NAPI poll loop
Burst packets; one interrupt, budgeted batch drain, then re-arm.
RX ring
occupancy 0/16
NIC interrupt line
ARMED
no poll pending
budget4/poll
drained this poll0
polls0
hard IRQs (NAPI)0
IRQs if 1-per-packet0
Idle. Burst some packets into the RX ring to begin.
N = 6
The point: NAPI has raised 0 hard IRQs for these 0 packets, because once the first interrupt masks the line the kernel keeps polling in bounded 4-packet batches instead of taking an interrupt per frame. Legacy one-IRQ-per-packet would have taken 0 β an interrupt storm under load, with context-switch overhead drowning real work.
Cut-through (CTPIO)
Store-and-forward vs starting on the wire as bytes arrive.
Store-and-forward β wait for the last byte, then transmit
First bit on the wire at t = 15 β only after the full frame has crossed PCIe.
CTPIO cut-through β start transmitting as bytes arrive
First bit on the wire at t = 2 β the NIC streams onto the wire while the rest of the frame is still arriving.
t = 0 Β· cut-through finishes ~14Γ sooner to first byte
not yet arrived
arrived over PCIe
transmitting on the wire
RDMA queue pair
Post a WR, the NIC executes it zero-copy, a completion lands β no kernel.
USERSPACE β app virtual address space
registered buffer
4096Bpinned Β· MR registered
NIC has the address translation cached.
Send Queue (SQ)
empty
Completion Queue (CQ)
empty
no kernel Β· no copy Β· no syscall on the data path
HARDWARE β RDMA NIC & the wire
DMA engine idle
wire quiet
Buffer registered & pinned. Post a Send WR to begin β no syscall involved.
SQ 0 Β· inflight 0 Β· CQ 0
Why it's fast: the SQ/CQ live in memory shared between the app and the NIC. Posting and reaping are plain memory writes/reads plus a doorbell β the kernel is bypassed entirely, so there is no system-call and no scheduler on the data path. Because the buffer is pre-registered, the NIC DMAs it directly to the wire with zero copies β contrast a sockets send, which traps into the kernel and copies the payload user β kernel skb before the NIC ever sees it.
Concurrency & the memory model
Make lock-free code break, then fix it.
Memory reordering
Run the race in relaxed mode (stale read), then release/acquire.
Producer
data = 42;
ready = 1; // plain / relaxed
Consumer
while (!ready) // plain / relaxed
;
use(data);
Global order in which writes actually become visible:
Press run to watch the two threads race.
CAS retry loop
Inject an interfering write; watch the CAS fail and retry.
shared atomic
0
thread A registers
expected (read)β
desired (new)β
memory == expected?β
retries
0
Step thread A through read β compute β CAS. Inject an interfering write to force a retry.
value 0 Β· expected β Β· retries 0
The ABA problem
Step the AβBβA interleaving; toggle the tagged-pointer fix.
head pointer
head = A
stack (top β bottom)
Aβ B
Bβ C
Cβ β
free list
β none
β T1 reads head β A, plans CAS(head, A, A.next = B). Then T1 STALLS.
β T2 pops A (head β B), frees A.
β T2 pops B (head β C), frees B.
β T2 reuses freed A, pushes it back (head β A, A.next = C).
β T1 wakes and runs CAS(head, A, B) β the saved 'next'.
phase 0/5 Β· raw CAS
SPSC ring buffer
Enqueue/dequeue; head and tail chase each other around.
head (write β 0) tail (read β 0) occupied
Buffer initialized β capacity 8, empty.
MPMC ring
Two producers CAS a shared head; the loser retries.
shared head (next free)
head = 0 β slot 0
P1
payload11
reserved slotβ
stateidle
P2
payload22
reserved slotβ
stateidle
[0]Β·
[1]Β·
[2]Β·
[3]Β·
[4]Β·
[5]Β·
[6]Β·
[7]Β·
Step to interleave P1/P2: each reserves head via CAS, then commits to its slot. Use βP2 steals headβ to force P1's CAS to fail.
Contrast β SPSC. With a single producer nobody else moves head, so reservation needs no CAS: a plain head++ (relaxed) plus a release store of the payload suffices. The CAS loop exists only because MPMC has contending writers on the shared index.
head 0 Β· slots used 0/8 Β· next P1
RCU grace period
Free now (use-after-free) vs after the grace period.
headβX (node)βtail
reader 1pre-existingholds β H
t = 0 Β· RCU Β· linked
False sharing
Two variables on one cache line ping-pong between cores.
one shared line β owner: A
a
b
core A ops
0
core B ops
0
0 ops0 cache-line transfers β the line ping-pongs every write
t = 0/48
CPU & microarchitecture
The hardware reality your C runs on.
Cache hierarchy
Pick what's cached, then load β watch the cycles to the hit level.
L1
holds the line
~32 KB, per-core. The fast path.
present+1 cyc
L2
holds the line
~256 KB-1 MB, per-core. A few cycles.
present+4 cyc
L3
holds the line
Shared, MBs. Tens of cycles.
present+12 cyc
DRAM
holds the line
Main memory. The cliff: ~100 cycles.
present+100 cyc
0 cycready
target hit: L1 @ 1 cyc
Cache locality
Sequential vs random over the same 64 elements.
Each row of 8 is one 64-byte cache line. red = miss (line fetched from RAM), teal = hit (already in cache).
0 cyc0 misses / 0 accesses
0/64 accessed
MESI coherence
Read/write on each core; watch the line's MESI state move.
Core A
I
Invalid
no valid data
Core B
I
Invalid
no valid data
main memoryup to date
A: I Β· B: I
Cache line starts Invalid in both cores; the copy lives only in memory.
MESI invariant: at most one core may hold the line M or E. A write needs exclusive ownership, so it invalidates every other copy first.
TLB & page walk
Access pages β HIT, or a MISS that walks the page table.
PGD
β
PUD
β
PMD
β
PTE
β¦
click a page
TLB β 3 entries (LRU)
MRUβ empty β
β empty β
LRUβ empty β
hits 0 Β· misses 0 Β· hit-rate 0%
TLB: a hit translates in ~1 cycle; a miss costs a full 4-level walk (~100 cycles). Re-clicking a cached page is a hit; new pages evict the LRU entry.
Branch prediction
Drag predictability down and watch mispredicts tank throughput.
predictability 50%data: noisy data
β predicted Β β mispredict (β15-cycle flush) Β Β·Β showing first 64 of 4,096 branches
flush penalty = 15 cycles
The branch is free β until it's wrong. A correctly predicted branch retires at ~1/cycle. A mispredict flushes the whole speculative pipeline (~15 cycles of work thrown away). On a sorted array the branch is taken in long runs, so the predictor nails it; on a shuffledarray it's a coin flip and throughput collapses β same code, same data, radically different speed. This is why hot loops get branchless rewrites (cmov, masks, sorting first).
NUMA local vs remote
Access local vs remote memory across the interconnect.
NUMA node 0
CPU 0worker
local RAM Β· 80 ns
inter- connect+60 ns
NUMA node 1
CPU 1(idle)
remote RAM Β· 80 ns
80 nsLOCAL access β best case
0 accesses Β· avg 80 ns
All accesses stayed local. Now hit remote a few times and watch the average climb β that gap is why thread + memory placement matters on a multi-socket box.
Batching
Amortize a fixed per-op cost across a batch β throughput rises.
batch size B = 14,096 doorbells for 4,096 items
π1 doorbell (300 units) shared by 1 item
total time
1,269,760
throughput
0.003
overhead share
97%
5% of plateau throughput
Amortize the fixed cost. Every operation pays a fixed ~300-unit tax (syscall, MMIO doorbell write, IRQ) plus ~10/item of real work. At B=1 the overhead dominates (96% of total). Grow the batch and one doorbell covers the whole burst β throughput climbs fast, then plateaus once per-item cost dominates. This is NIC TX batching: fill a ring of descriptors, ring the doorbell once per burst instead of once per packet.
Networking, TCP & AI fabric
From the handshake to GPU collectives.
TCP encapsulation
Step a segment through TCP β IP β Ethernet β the wire.
Application
TCP
IP
Link (driver)
NIC / DMA
Wire
send()
payloadyour data
wire β
1/6send(fd, buf, len) hands your bytes to the kernel β just the payload, in a socket buffer.
Ethernet header + FCS β added by the link layer / NIC
your payload β never copied again under zero-copy / kernel bypass
TCP 3-way handshake
Step SYN β SYN-ACK β ACK; the state machine on each side.
CLIENT
CLOSED
press step / play to send the first segment
SERVER
LISTEN
0/4Server is LISTENing (passive open); client is about to actively open. No segments on the wire yet.
client: CLOSED Β· server: LISTEN
Why three messages? Each side must prove it can both send and receive, so each side's SYN needs its own ACK β that's four exchanges, but the server's ACK and SYN ride one segment (SYN-ACK), giving three. The ack number is always the next sequence byte expected(x+1, y+1), not what was received. The connection isn't usable until the final ACK lands β a full 1 RTT of setup latency before byte one.
TCP congestion (AIMD)
The window ramps and halves on loss β the sawtooth.
cwnd = 1 Β· red dot = packet loss β window halved
The shape is the point.TCP doesn't blast at line rate β it probes: exponential slow start, then linear additive increase, until a loss signals congestion and it multiplicatively decreases (halves). That sawtooth is AIMD. It also means a single low-latency request riding a fat bulk flow inherits that queue β which is why low-latency stacks pace, use small buffers, or bypass TCP entirely (UDP / RDMA) on the hot path.
TCP incast
Drag the sender count past the buffer and watch drops spike.
TCP incast. N senders synchronise a burst at one egress port. While N β€ 4 the port drains as fast as packets arrive and the buffer stays shallow. Push N past what the buffer can absorb in one RTT and the tail overflows: packets are dropped, each dropped flow stalls a full RTO before retrying, and aggregate goodput collapses even though the link is barely used. Drag N up to find the cliff; a bigger buffer pushes the cliff right but adds queueing latency.
All-reduce: ring vs tree
Slide the GPU count; compare 2(Nβ1) vs 2Β·logβN steps.
Ring
2(Nβ1) = 14
Tree
2βlogβNβ = 6
hop 0/14
Why the algorithm isn't βa detail.β Ring all-reduce is bandwidth-optimal (each GPU sends ~2Γ its data, total independent of N) but takes 2(Nβ1) latency hops; tree / halving-doubling takes only 2βlogβNβ. At N = 1024 that's 2046 hops vs 20. On a GPU cluster the collective is on the critical path β one slow link or a bad algorithm choice stalls every GPU, which is exactly why AI fabrics obsess over tail latency and in-order delivery.