27 interactive demosdrive every concept

Every concept, hands-on

Not loops to watch — demos you drive. Toggle, step, drag, and break each idea across the whole stack: the NIC datapath, the C memory model, CPU microarchitecture, and networking / AI-fabric concepts.

The NIC datapath

Drive a packet between the host and the wire.

Descriptor ring (RX/TX)

Post buffers, ring the doorbell, watch the NIC DMA and complete — toggle RX vs TX.

MMIO doorbell idle

HEAD 0 — hardware-owned (NIC consumes) TAIL 0 — driver-owned (driver posts) OWN0 — handed to NIC DD1 — done, awaiting reap

Ring initialized — 8 descriptors, all FREE.

Barrier before the doorbell: the descriptor writes must be globally visible before the doorbell MMIO write, or the NIC can DMA a half-written descriptor. The driver issues a write barrier (wmb()), then the posted doorbell write. Completions are learned from the DMA'd OWN/DD bit in host memory — never by polling a NIC register.

Send path latency

Toggle kernel-bypass / TX_PUSH / CTPIO and watch the nanoseconds drop.

1230 nsbaseline · 1230 ns · toggle a technique to shave it

send() syscall + mode switch90 ns

kernel TCP/IP + sk_buff220 ns

copy into kernel buffer80 ns

softirq / scheduler wakeup110 ns

driver writes TX descriptor40 ns

NIC reads descriptor over PCIe300 ns

doorbell MMIO write60 ns

DMA across PCIe, start TX250 ns

SerDes / PHY onto the wire80 ns

Interrupt vs busy-poll (RX)

Toggle busy-poll to delete the IRQ → softirq → wakeup tax.

890 nsbaseline · 890 ns · toggle a technique to shave it

NIC DMAs the packet into host memory200 ns

raise MSI-X interrupt140 ns

IRQ entry + NAPI schedule170 ns

scheduler wakes the app260 ns

deliver to app (copy / return)90 ns

app reads the descriptor30 ns

MMIO write vs read

A posted write is fire-and-forget; a non-posted read stalls the core.

Posted writes(e.g. a doorbell) — fire & forget

issuing…

The CPU issues each write and immediately moves on — the write drains to the device asynchronously.

Non-posted reads (an MMIO register read) — CPU stalls for the completion

stall…

0/5 done

Each read is a full PCIe round-trip; the core cannot retire it until the completion returns — so reads serialize and stall.

t = 0 · 5 writes finish ~6× sooner than 5 reads

The takeaway for the datapath: never poll a NIC register on the hot path. You ring a doorbell(a posted write that doesn't stall), and you learn about completions from DMA'd status / events in host memory — never from an MMIO read. One stray read can cost more than the whole rest of the send.

IOMMU / IOVA

DMA to an IOVA — translated and bounds-checked, or blocked.

device issues DMA to:

IOMMU page table — per-device (IOVA → PA)

0x1000→0x4000RX ring buffer

0x2000→0x8000TX ring buffer

other IOVAs→faultno PTE — DMA blocked

click an IOVA — the IOMMU decides ALLOW or BLOCK

Why the IOMMU matters: it gives each device its own address space. Unmapped or out-of-range DMA is blocked, so a compromised NIC can't read or write arbitrary host memory — the same hardware that lets you safely pass a device through to a VM/userspace (VFIO). With it disabled (passthrough) a stray IOVA like 0xf000 could land in 0x0100 (KERNEL text (host memory)).

NAPI poll loop

Burst packets; one interrupt, budgeted batch drain, then re-arm.

RX ring

occupancy 0/16

NIC interrupt line

ARMED

no poll pending

budget4/poll

drained this poll0

polls0

hard IRQs (NAPI)0

IRQs if 1-per-packet0

Idle. Burst some packets into the RX ring to begin.

N = 6

The point: NAPI has raised 0 hard IRQs for these 0 packets, because once the first interrupt masks the line the kernel keeps polling in bounded 4-packet batches instead of taking an interrupt per frame. Legacy one-IRQ-per-packet would have taken 0 — an interrupt storm under load, with context-switch overhead drowning real work.

Cut-through (CTPIO)

Store-and-forward vs starting on the wire as bytes arrive.

Store-and-forward — wait for the last byte, then transmit

First bit on the wire at t = 15 — only after the full frame has crossed PCIe.

CTPIO cut-through — start transmitting as bytes arrive

First bit on the wire at t = 2 — the NIC streams onto the wire while the rest of the frame is still arriving.

t = 0 · cut-through finishes ~14× sooner to first byte

not yet arrived

arrived over PCIe

transmitting on the wire

RDMA queue pair

Post a WR, the NIC executes it zero-copy, a completion lands — no kernel.

USERSPACE — app virtual address space

registered buffer

4096Bpinned · MR registered

NIC has the address translation cached.

Send Queue (SQ)

empty

Completion Queue (CQ)

empty

no kernel · no copy · no syscall on the data path

HARDWARE — RDMA NIC & the wire

DMA engine idle

wire quiet

Buffer registered & pinned. Post a Send WR to begin — no syscall involved.

SQ 0 · inflight 0 · CQ 0

Why it's fast: the SQ/CQ live in memory shared between the app and the NIC. Posting and reaping are plain memory writes/reads plus a doorbell — the kernel is bypassed entirely, so there is no system-call and no scheduler on the data path. Because the buffer is pre-registered, the NIC DMAs it directly to the wire with zero copies — contrast a sockets send, which traps into the kernel and copies the payload user → kernel skb before the NIC ever sees it.

Concurrency & the memory model

Make lock-free code break, then fix it.

Memory reordering

Run the race in relaxed mode (stale read), then release/acquire.

Producer

data = 42;
ready = 1;   // plain / relaxed

Consumer

while (!ready)   // plain / relaxed
   ;
use(data);

Global order in which writes actually become visible:

Press run to watch the two threads race.

CAS retry loop

Inject an interfering write; watch the CAS fail and retry.

shared atomic

thread A registers

expected (read)—

desired (new)—

memory == expected?—

retries

Step thread A through read → compute → CAS. Inject an interfering write to force a retry.

value 0 · expected — · retries 0

The ABA problem

Step the A→B→A interleaving; toggle the tagged-pointer fix.

head pointer

head = A

stack (top → bottom)

A→ B

B→ C

C→ ∅

free list

∅ none

○ T1 reads head → A, plans CAS(head, A, A.next = B). Then T1 STALLS.

○ T2 pops A (head ← B), frees A.

○ T2 pops B (head ← C), frees B.

○ T2 reuses freed A, pushes it back (head ← A, A.next = C).

○ T1 wakes and runs CAS(head, A, B) — the saved 'next'.

phase 0/5 · raw CAS

SPSC ring buffer

Enqueue/dequeue; head and tail chase each other around.

head (write → 0) tail (read → 0) occupied

Buffer initialized — capacity 8, empty.

MPMC ring

Two producers CAS a shared head; the loser retries.

shared head (next free)

head = 0 → slot 0

payload11

reserved slot—

stateidle

payload22

reserved slot—

stateidle

[0]·

[1]·

[2]·

[3]·

[4]·

[5]·

[6]·

[7]·

Step to interleave P1/P2: each reserves head via CAS, then commits to its slot. Use “P2 steals head” to force P1's CAS to fail.

Contrast — SPSC. With a single producer nobody else moves head, so reservation needs no CAS: a plain head++ (relaxed) plus a release store of the payload suffices. The CAS loop exists only because MPMC has contending writers on the shared index.

head 0 · slots used 0/8 · next P1

RCU grace period

Free now (use-after-free) vs after the grace period.

head→X (node)→tail

reader 1pre-existingholds → H

t = 0 · RCU · linked

False sharing

Two variables on one cache line ping-pong between cores.

one shared line — owner: A

core A ops

core B ops

0 ops0 cache-line transfers — the line ping-pongs every write

t = 0/48

CPU & microarchitecture

The hardware reality your C runs on.

Cache hierarchy

Pick what's cached, then load — watch the cycles to the hit level.

holds the line

~32 KB, per-core. The fast path.

present+1 cyc

holds the line

~256 KB-1 MB, per-core. A few cycles.

present+4 cyc

holds the line

Shared, MBs. Tens of cycles.

present+12 cyc

DRAM

holds the line

Main memory. The cliff: ~100 cycles.

present+100 cyc

0 cycready

target hit: L1 @ 1 cyc

Cache locality

Sequential vs random over the same 64 elements.

Each row of 8 is one 64-byte cache line. red = miss (line fetched from RAM), teal = hit (already in cache).

0 cyc0 misses / 0 accesses

0/64 accessed

MESI coherence

Read/write on each core; watch the line's MESI state move.

Core A

Invalid

no valid data

Core B

Invalid

no valid data

main memoryup to date

A: I · B: I

Cache line starts Invalid in both cores; the copy lives only in memory.

MESI invariant: at most one core may hold the line M or E. A write needs exclusive ownership, so it invalidates every other copy first.

TLB & page walk

Access pages — HIT, or a MISS that walks the page table.

PGD

→

PUD

→

PMD

→

PTE

…

click a page

TLB — 3 entries (LRU)

MRU— empty —

— empty —

LRU— empty —

hits 0 · misses 0 · hit-rate 0%

TLB: a hit translates in ~1 cycle; a miss costs a full 4-level walk (~100 cycles). Re-clicking a cached page is a hit; new pages evict the LRU entry.

Branch prediction

Drag predictability down and watch mispredicts tank throughput.

predictability 50%data: noisy data

■predicted ■ mispredict (≈15-cycle flush) · showing first 64 of 4,096 branches

flush penalty = 15 cycles

The branch is free — until it's wrong. A correctly predicted branch retires at ~1/cycle. A mispredict flushes the whole speculative pipeline (~15 cycles of work thrown away). On a sorted array the branch is taken in long runs, so the predictor nails it; on a shuffledarray it's a coin flip and throughput collapses — same code, same data, radically different speed. This is why hot loops get branchless rewrites (cmov, masks, sorting first).

NUMA local vs remote

Access local vs remote memory across the interconnect.

NUMA node 0

CPU 0 worker

local RAM · 80 ns

inter-
connect

+60 ns

NUMA node 1

CPU 1 (idle)

remote RAM · 80 ns

80 nsLOCAL access — best case

0 accesses · avg 80 ns

All accesses stayed local. Now hit remote a few times and watch the average climb — that gap is why thread + memory placement matters on a multi-socket box.

Batching

Amortize a fixed per-op cost across a batch — throughput rises.

batch size B = 14,096 doorbells for 4,096 items

🔔

1 doorbell (300 units) shared by 1 item

total time

1,269,760

throughput

0.003

overhead share

97%

5% of plateau throughput

Amortize the fixed cost. Every operation pays a fixed ~300-unit tax (syscall, MMIO doorbell write, IRQ) plus ~10/item of real work. At B=1 the overhead dominates (96% of total). Grow the batch and one doorbell covers the whole burst — throughput climbs fast, then plateaus once per-item cost dominates. This is NIC TX batching: fill a ring of descriptors, ring the doorbell once per burst instead of once per packet.

Networking, TCP & AI fabric

From the handshake to GPU collectives.

TCP encapsulation

Step a segment through TCP → IP → Ethernet → the wire.

Application

TCP

Link (driver)

NIC / DMA

Wire

send()

payloadyour data

wire →

1/6send(fd, buf, len) hands your bytes to the kernel — just the payload, in a socket buffer.

Ethernet header + FCS — added by the link layer / NIC

IP header — routing, TTL, header checksum

TCP header — ports, seq/ack, window, checksum (often NIC-offloaded)

your payload — never copied again under zero-copy / kernel bypass

TCP 3-way handshake

Step SYN → SYN-ACK → ACK; the state machine on each side.

CLIENT

CLOSED

press step / play to send the first segment

SERVER

LISTEN

0/4Server is LISTENing (passive open); client is about to actively open. No segments on the wire yet.

client: CLOSED · server: LISTEN

Why three messages? Each side must prove it can both send and receive, so each side's SYN needs its own ACK — that's four exchanges, but the server's ACK and SYN ride one segment (SYN-ACK), giving three. The ack number is always the next sequence byte expected(x+1, y+1), not what was received. The connection isn't usable until the final ACK lands — a full 1 RTT of setup latency before byte one.

TCP congestion (AIMD)

The window ramps and halves on loss — the sawtooth.

cwnd = 1 · red dot = packet loss → window halved

The shape is the point.TCP doesn't blast at line rate — it probes: exponential slow start, then linear additive increase, until a loss signals congestion and it multiplicatively decreases (halves). That sawtooth is AIMD. It also means a single low-latency request riding a fat bulk flow inherits that queue — which is why low-latency stacks pace, use small buffers, or bypass TCP entirely (UDP / RDMA) on the hot path.

TCP incast

Drag the sender count past the buffer and watch drops spike.

switch egress port

0/12

→ uplink · drains 4/tick

drops0

delivered0/48

goodput0.00/tick · 0%

senders N: 8egress buffer: 12 pkts

tick 0 · N×1 = 8 pkts/tick offered vs 4 drained · buffer absorbing

TCP incast. N senders synchronise a burst at one egress port. While N ≤ 4 the port drains as fast as packets arrive and the buffer stays shallow. Push N past what the buffer can absorb in one RTT and the tail overflows: packets are dropped, each dropped flow stalls a full RTO before retrying, and aggregate goodput collapses even though the link is barely used. Drag N up to find the cliff; a bigger buffer pushes the cliff right but adds queueing latency.

All-reduce: ring vs tree

Slide the GPU count; compare 2(N−1) vs 2·log₂N steps.

GPUs (N): 8

Ring

2(N−1) = 14

Tree

2⌈log₂N⌉ = 6

hop 0/14

Why the algorithm isn't “a detail.” Ring all-reduce is bandwidth-optimal (each GPU sends ~2× its data, total independent of N) but takes 2(N−1) latency hops; tree / halving-doubling takes only 2⌈log₂N⌉. At N = 1024 that's 2046 hops vs 20. On a GPU cluster the collective is on the critical path — one slow link or a bad algorithm choice stalls every GPU, which is exactly why AI fabrics obsess over tail latency and in-order delivery.

← In context

Ring buffers in the NIC datapath

Drill →

The senior question bank