27 interactive demosdrive every concept

Every concept, hands-on

Not loops to watch β€” demos you drive. Toggle, step, drag, and break each idea across the whole stack: the NIC datapath, the C memory model, CPU microarchitecture, and networking / AI-fabric concepts.

The NIC datapath

Drive a packet between the host and the wire.

Descriptor ring (RX/TX)

Post buffers, ring the doorbell, watch the NIC DMA and complete β€” toggle RX vs TX.

MMIO doorbell idle
d0β€”d1β€”d2β€”d3β€”d4β€”d5β€”d6β€”d7β€”RX ringHEAD 0 Β· TAIL 0
HEAD 0 β€” hardware-owned (NIC consumes) TAIL 0 β€” driver-owned (driver posts) OWN0 β€” handed to NIC DD1 β€” done, awaiting reap
Ring initialized β€” 8 descriptors, all FREE.
Barrier before the doorbell: the descriptor writes must be globally visible before the doorbell MMIO write, or the NIC can DMA a half-written descriptor. The driver issues a write barrier (wmb()), then the posted doorbell write. Completions are learned from the DMA'd OWN/DD bit in host memory β€” never by polling a NIC register.

Send path latency

Toggle kernel-bypass / TX_PUSH / CTPIO and watch the nanoseconds drop.

1230 nsbaseline Β· 1230 ns Β· toggle a technique to shave it
send() syscall + mode switch90 ns
kernel TCP/IP + sk_buff220 ns
copy into kernel buffer80 ns
softirq / scheduler wakeup110 ns
driver writes TX descriptor40 ns
NIC reads descriptor over PCIe300 ns
doorbell MMIO write60 ns
DMA across PCIe, start TX250 ns
SerDes / PHY onto the wire80 ns

Interrupt vs busy-poll (RX)

Toggle busy-poll to delete the IRQ β†’ softirq β†’ wakeup tax.

890 nsbaseline Β· 890 ns Β· toggle a technique to shave it
NIC DMAs the packet into host memory200 ns
raise MSI-X interrupt140 ns
IRQ entry + NAPI schedule170 ns
scheduler wakes the app260 ns
deliver to app (copy / return)90 ns
app reads the descriptor30 ns

MMIO write vs read

A posted write is fire-and-forget; a non-posted read stalls the core.

Posted writes(e.g. a doorbell) β€” fire & forget
W0
W1
W2
W3
W4
issuing…
The CPU issues each write and immediately moves on β€” the write drains to the device asynchronously.
Non-posted reads (an MMIO register read) β€” CPU stalls for the completion
stall…
R1
R2
R3
R4
0/5 done
Each read is a full PCIe round-trip; the core cannot retire it until the completion returns β€” so reads serialize and stall.
t = 0 Β· 5 writes finish ~6Γ— sooner than 5 reads
The takeaway for the datapath: never poll a NIC register on the hot path. You ring a doorbell(a posted write that doesn't stall), and you learn about completions from DMA'd status / events in host memory β€” never from an MMIO read. One stray read can cost more than the whole rest of the send.

IOMMU / IOVA

DMA to an IOVA β€” translated and bounds-checked, or blocked.

device issues DMA to:
IOMMU page table β€” per-device (IOVA β†’ PA)
0x1000β†’0x4000RX ring buffer
0x2000β†’0x8000TX ring buffer
other IOVAs→faultno PTE — DMA blocked
click an IOVA β€” the IOMMU decides ALLOW or BLOCK
Β 
Why the IOMMU matters: it gives each device its own address space. Unmapped or out-of-range DMA is blocked, so a compromised NIC can't read or write arbitrary host memory β€” the same hardware that lets you safely pass a device through to a VM/userspace (VFIO). With it disabled (passthrough) a stray IOVA like 0xf000 could land in 0x0100 (KERNEL text (host memory)).

NAPI poll loop

Burst packets; one interrupt, budgeted batch drain, then re-arm.

RX ring
occupancy 0/16
NIC interrupt line
ARMED
no poll pending
budget4/poll
drained this poll0
polls0
hard IRQs (NAPI)0
IRQs if 1-per-packet0
Idle. Burst some packets into the RX ring to begin.
N = 6
The point: NAPI has raised 0 hard IRQs for these 0 packets, because once the first interrupt masks the line the kernel keeps polling in bounded 4-packet batches instead of taking an interrupt per frame. Legacy one-IRQ-per-packet would have taken 0 β€” an interrupt storm under load, with context-switch overhead drowning real work.

Cut-through (CTPIO)

Store-and-forward vs starting on the wire as bytes arrive.

Store-and-forward β€” wait for the last byte, then transmit
First bit on the wire at t = 15 β€” only after the full frame has crossed PCIe.
CTPIO cut-through β€” start transmitting as bytes arrive
First bit on the wire at t = 2 β€” the NIC streams onto the wire while the rest of the frame is still arriving.
t = 0 Β· cut-through finishes ~14Γ— sooner to first byte
not yet arrived
arrived over PCIe
transmitting on the wire

RDMA queue pair

Post a WR, the NIC executes it zero-copy, a completion lands β€” no kernel.

USERSPACE β€” app virtual address space
registered buffer
4096Bpinned Β· MR registered
NIC has the address translation cached.
Send Queue (SQ)
empty
Completion Queue (CQ)
empty
no kernel Β· no copy Β· no syscall on the data path
HARDWARE β€” RDMA NIC & the wire
DMA engine idle
wire quiet
Buffer registered & pinned. Post a Send WR to begin β€” no syscall involved.
SQ 0 Β· inflight 0 Β· CQ 0
Why it's fast: the SQ/CQ live in memory shared between the app and the NIC. Posting and reaping are plain memory writes/reads plus a doorbell β€” the kernel is bypassed entirely, so there is no system-call and no scheduler on the data path. Because the buffer is pre-registered, the NIC DMAs it directly to the wire with zero copies β€” contrast a sockets send, which traps into the kernel and copies the payload user β†’ kernel skb before the NIC ever sees it.

Concurrency & the memory model

Make lock-free code break, then fix it.

Memory reordering

Run the race in relaxed mode (stale read), then release/acquire.

Producer
data = 42;
ready = 1;   // plain / relaxed
Consumer
while (!ready)   // plain / relaxed
   ;
use(data);
Global order in which writes actually become visible:
Press run to watch the two threads race.

CAS retry loop

Inject an interfering write; watch the CAS fail and retry.

shared atomic
0
thread A registers
expected (read)β€”
desired (new)β€”
memory == expected?β€”
retries
0
Step thread A through read β†’ compute β†’ CAS. Inject an interfering write to force a retry.
value 0 Β· expected β€” Β· retries 0

The ABA problem

Step the A→B→A interleaving; toggle the tagged-pointer fix.

head pointer
head = A
stack (top β†’ bottom)
A→ B
B→ C
Cβ†’ βˆ…
free list
βˆ… none
β—‹ T1 reads head β†’ A, plans CAS(head, A, A.next = B). Then T1 STALLS.
β—‹ T2 pops A (head ← B), frees A.
β—‹ T2 pops B (head ← C), frees B.
β—‹ T2 reuses freed A, pushes it back (head ← A, A.next = C).
β—‹ T1 wakes and runs CAS(head, A, B) β€” the saved 'next'.
phase 0/5 Β· raw CAS

SPSC ring buffer

Enqueue/dequeue; head and tail chase each other around.

012345670 / 8EMPTY
head (write β†’ 0) tail (read β†’ 0) occupied
Buffer initialized β€” capacity 8, empty.

MPMC ring

Two producers CAS a shared head; the loser retries.

shared head (next free)
head = 0 β†’ slot 0
P1
payload11
reserved slotβ€”
stateidle
P2
payload22
reserved slotβ€”
stateidle
[0]Β·
[1]Β·
[2]Β·
[3]Β·
[4]Β·
[5]Β·
[6]Β·
[7]Β·
Step to interleave P1/P2: each reserves head via CAS, then commits to its slot. Use β€œP2 steals head” to force P1's CAS to fail.
Contrast β€” SPSC. With a single producer nobody else moves head, so reservation needs no CAS: a plain head++ (relaxed) plus a release store of the payload suffices. The CAS loop exists only because MPMC has contending writers on the shared index.
head 0 Β· slots used 0/8 Β· next P1

RCU grace period

Free now (use-after-free) vs after the grace period.

head→X (node)→tail
reader 1pre-existingholds β†’ H
t = 0 Β· RCU Β· linked

False sharing

Two variables on one cache line ping-pong between cores.

one shared line β€” owner: A
a
b
core A ops
0
core B ops
0
0 ops0 cache-line transfers β€” the line ping-pongs every write
t = 0/48

CPU & microarchitecture

The hardware reality your C runs on.

Cache hierarchy

Pick what's cached, then load β€” watch the cycles to the hit level.

L1
holds the line
~32 KB, per-core. The fast path.
present+1 cyc
L2
holds the line
~256 KB-1 MB, per-core. A few cycles.
present+4 cyc
L3
holds the line
Shared, MBs. Tens of cycles.
present+12 cyc
DRAM
holds the line
Main memory. The cliff: ~100 cycles.
present+100 cyc
0 cycready
target hit: L1 @ 1 cyc

Cache locality

Sequential vs random over the same 64 elements.

Each row of 8 is one 64-byte cache line. red = miss (line fetched from RAM), teal = hit (already in cache).
0 cyc0 misses / 0 accesses
0/64 accessed

MESI coherence

Read/write on each core; watch the line's MESI state move.

Core A
I
Invalid
no valid data
Core B
I
Invalid
no valid data
main memoryup to date
A: I Β· B: I
Cache line starts Invalid in both cores; the copy lives only in memory.
MESI invariant: at most one core may hold the line M or E. A write needs exclusive ownership, so it invalidates every other copy first.

TLB & page walk

Access pages β€” HIT, or a MISS that walks the page table.

PGD
β†’
PUD
β†’
PMD
β†’
PTE
…
click a page
TLB β€” 3 entries (LRU)
MRUβ€” empty β€”
β€” empty β€”
LRUβ€” empty β€”
hits 0 Β· misses 0 Β· hit-rate 0%
TLB: a hit translates in ~1 cycle; a miss costs a full 4-level walk (~100 cycles). Re-clicking a cached page is a hit; new pages evict the LRU entry.

Branch prediction

Drag predictability down and watch mispredicts tank throughput.

predictability 50%data: noisy data
β– predicted Β β–  mispredict (β‰ˆ15-cycle flush) Β Β·Β  showing first 64 of 4,096 branches
flush penalty = 15 cycles
The branch is free β€” until it's wrong. A correctly predicted branch retires at ~1/cycle. A mispredict flushes the whole speculative pipeline (~15 cycles of work thrown away). On a sorted array the branch is taken in long runs, so the predictor nails it; on a shuffledarray it's a coin flip and throughput collapses β€” same code, same data, radically different speed. This is why hot loops get branchless rewrites (cmov, masks, sorting first).

NUMA local vs remote

Access local vs remote memory across the interconnect.

NUMA node 0
CPU 0 worker
local RAM Β· 80 ns
NUMA node 1
CPU 1 (idle)
remote RAM Β· 80 ns
80 nsLOCAL access β€” best case
0 accesses Β· avg 80 ns
All accesses stayed local. Now hit remote a few times and watch the average climb β€” that gap is why thread + memory placement matters on a multi-socket box.

Batching

Amortize a fixed per-op cost across a batch β€” throughput rises.

throughput (items / unit)batch size β†’ 64
batch size B = 14,096 doorbells for 4,096 items
πŸ””
1 doorbell (300 units) shared by 1 item
total time
1,269,760
throughput
0.003
overhead share
97%
5% of plateau throughput
Amortize the fixed cost. Every operation pays a fixed ~300-unit tax (syscall, MMIO doorbell write, IRQ) plus ~10/item of real work. At B=1 the overhead dominates (96% of total). Grow the batch and one doorbell covers the whole burst β€” throughput climbs fast, then plateaus once per-item cost dominates. This is NIC TX batching: fill a ring of descriptors, ring the doorbell once per burst instead of once per packet.

Networking, TCP & AI fabric

From the handshake to GPU collectives.

TCP encapsulation

Step a segment through TCP β†’ IP β†’ Ethernet β†’ the wire.

Application
TCP
IP
Link (driver)
NIC / DMA
Wire
send()
payloadyour data
wire β†’
1/6send(fd, buf, len) hands your bytes to the kernel β€” just the payload, in a socket buffer.
Ethernet header + FCS β€” added by the link layer / NIC
IP header β€” routing, TTL, header checksum
TCP header β€” ports, seq/ack, window, checksum (often NIC-offloaded)
your payload β€” never copied again under zero-copy / kernel bypass

TCP 3-way handshake

Step SYN β†’ SYN-ACK β†’ ACK; the state machine on each side.

CLIENT
CLOSED
press step / play to send the first segment
SERVER
LISTEN
0/4Server is LISTENing (passive open); client is about to actively open. No segments on the wire yet.
client: CLOSED Β· server: LISTEN
Why three messages? Each side must prove it can both send and receive, so each side's SYN needs its own ACK β€” that's four exchanges, but the server's ACK and SYN ride one segment (SYN-ACK), giving three. The ack number is always the next sequence byte expected(x+1, y+1), not what was received. The connection isn't usable until the final ACK lands β€” a full 1 RTT of setup latency before byte one.

TCP congestion (AIMD)

The window ramps and halves on loss β€” the sawtooth.

link capacity β†’ losscwnd (segments in flight)
cwnd = 1 Β· red dot = packet loss β†’ window halved
The shape is the point.TCP doesn't blast at line rate β€” it probes: exponential slow start, then linear additive increase, until a loss signals congestion and it multiplicatively decreases (halves). That sawtooth is AIMD. It also means a single low-latency request riding a fat bulk flow inherits that queue β€” which is why low-latency stacks pace, use small buffers, or bypass TCP entirely (UDP / RDMA) on the hot path.

TCP incast

Drag the sender count past the buffer and watch drops spike.

S1
S2
S3
S4
S5
S6
S7
S8
switch egress port
0/12
drops0
delivered0/48
goodput0.00/tick Β· 0%
tick 0 Β· NΓ—1 = 8 pkts/tick offered vs 4 drained Β· buffer absorbing
TCP incast. N senders synchronise a burst at one egress port. While N ≀ 4 the port drains as fast as packets arrive and the buffer stays shallow. Push N past what the buffer can absorb in one RTT and the tail overflows: packets are dropped, each dropped flow stalls a full RTO before retrying, and aggregate goodput collapses even though the link is barely used. Drag N up to find the cliff; a bigger buffer pushes the cliff right but adds queueing latency.

All-reduce: ring vs tree

Slide the GPU count; compare 2(Nβˆ’1) vs 2Β·logβ‚‚N steps.

G0G1G2G3G4G5G6G7
Ring
2(Nβˆ’1) = 14
Tree
2⌈logβ‚‚NβŒ‰ = 6
hop 0/14
Why the algorithm isn't β€œa detail.” Ring all-reduce is bandwidth-optimal (each GPU sends ~2Γ— its data, total independent of N) but takes 2(Nβˆ’1) latency hops; tree / halving-doubling takes only 2⌈logβ‚‚NβŒ‰. At N = 1024 that's 2046 hops vs 20. On a GPU cluster the collective is on the critical path β€” one slow link or a bad algorithm choice stalls every GPU, which is exactly why AI fabrics obsess over tail latency and in-order delivery.