Reading a real driver: ixy
Everything on the datapath page and in the earn-the-vocabulary drills is abstract until you see it in real code. ixy is an educational userspace driver for Intel 82599 (ixgbe) NICs β a few hundred lines of readable C that does the whole RX/TX datapath. If you can talk through these excerpts, the βhave you written a driver?β gap shrinks to βI've read and understood one.β I cloned the repo and walked it end to end; below is the tour, illustrated.
0 Β· The whole repo at a glance
ixy exposes one generic struct ixy_device to your application β a little C vtable of function pointers (rx_batch, tx_batch, β¦). The app calls the inline stub ixy_rx_batch(), which dispatches through the pointer to the real ixgbe_rx_batch() (or the virtio one). The driver recovers its private struct with container_of β the same trick the Linux kernel uses. At startup ixy_init() reads PCI config space, checks the device class is a NIC, and installs the ixgbe (or virtio) function pointers based on the vendor ID β runtime polymorphism from a plain C struct.
// device.h β ONE generic device the app talks to; each driver fills in the pointers
struct ixy_device {
const char* pci_addr;
const char* driver_name;
uint16_t num_rx_queues, num_tx_queues;
uint32_t (*rx_batch)(struct ixy_device*, uint16_t q, struct pkt_buf* bufs[], uint32_t n);
uint32_t (*tx_batch)(struct ixy_device*, uint16_t q, struct pkt_buf* bufs[], uint32_t n);
// ... read_stats, set_promisc, get/set_mac_addr ...
};
// the app never calls ixgbe code directly β it calls this stub, which dispatches
static inline uint32_t ixy_rx_batch(struct ixy_device* dev, uint16_t q,
struct pkt_buf* bufs[], uint32_t n) {
return dev->rx_batch(dev, q, bufs, n); // -> ixgbe_rx_batch (or virtio_rx_batch)
}
That indirection is the whole reason a 60-line ixy-fwd app can drive two completely different NIC families without changing a line.
1 Β· Mapping the NIC: PCIe & MMIO
Before you can touch a register you have to own the device. ixy unbinds the kernel driver, flips the bus-master bit in PCI config space (so the NIC is allowed to DMA), and mmaps the device's BAR0 (/sys/bus/pci/devices/<addr>/resource0) into its own address space. After that, a NIC register is just hw + reg.
// pci.c β turn a PCI device into a pointer you can write
uint8_t* pci_map_resource(const char* pci_addr) {
remove_driver(pci_addr); // unbind the kernel's ixgbe driver
enable_dma(pci_addr); // set the bus-master bit in PCI config space
int fd = open("/sys/bus/pci/devices/<addr>/resource0", O_RDWR);
struct stat st; fstat(fd, &st);
// BAR0 is now a normal memory mapping in our address space:
uint8_t* hw = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);
return hw; // hw + reg == a NIC register
}
// device.h β an MMIO register write is just a volatile store to the mapped BAR
static inline void set_reg32(uint8_t* addr, int reg, uint32_t value) {
__asm__ volatile ("" : : : "memory"); // compiler barrier β NOT a CPU barrier
*((volatile uint32_t*) (addr + reg)) = value;
}
volatile store to an mmapped BAR; the CPU's memory controller routes it across PCIe to the device. ixy only needs a compilerbarrier on x86 β the hardware keeps stores ordered β but on a weakly-ordered CPU you'd need a real barrier.β 2 Β· DMA β the NIC needs physical addresses
The NIC is a bus master: it reads and writes host RAM by physical (or IOMMU) address β it knows nothing of your process's virtual addresses. So a userspace driver has to do two unusual things: get the physical address of a buffer, and make sure that buffer never moves or swaps out.
// memory.c β translate a virtual address to a physical one via /proc/self/pagemap
static uintptr_t virt_to_phys(void *virt) {
long pagesize = sysconf(_SC_PAGESIZE);
int fd = open("/proc/self/pagemap", O_RDONLY);
lseek(fd, (uintptr_t)virt / pagesize * sizeof(uintptr_t), SEEK_SET);
uintptr_t phy = 0;
read(fd, &phy, sizeof(phy));
close(fd);
// bits 0-54 are the page number
return (phy & 0x7fffffffffffffULL) * pagesize + (uintptr_t)virt % pagesize;
}
The buffers come from huge pages (fewer TLB entries, and 2 MB of contiguous physical memory) that are mlocked so the kernel can never swap them while the NIC might DMA into them:
// memory.c β memory the NIC can DMA into: huge pages, pinned, never swapped
struct dma_memory memory_allocate_dma(size_t size, bool require_contiguous) {
// ... open a file on hugetlbfs (/mnt/huge), round size up to 2 MB ...
void *virt = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_HUGETLB, fd, 0);
mlock(virt, size); // never swap out DMA memory
return (struct dma_memory){
.virt = virt,
.phy = virt_to_phys(virt) // <-- the address the NIC actually uses
};
}
That region is then carved into a mempool of fixed-size pkt_bufs. Each buffer caches its own buf_addr_phy once, so the hot path never has to translate an address again β it just reads the field:
// memory.c β carve the DMA region into fixed-size packet buffers
struct mempool* memory_allocate_mempool(uint32_t num_entries, uint32_t entry_size) {
entry_size = entry_size ? entry_size : 2048; // must divide the hugepage
struct dma_memory mem = memory_allocate_dma(num_entries * entry_size, false);
// ... mempool->base_addr = mem.virt; a free-stack of entry ids ...
for (uint32_t i = 0; i < num_entries; i++) {
struct pkt_buf* buf = mempool->base_addr + i * entry_size;
buf->buf_addr_phy = virt_to_phys(buf); // each buf carries its OWN phys addr
buf->mempool_idx = i;
buf->mempool = mempool;
}
return mempool;
}
/proc/self/pagemapfor the translation and uses hugepages so the memory is contiguous and never swapped.β With an IOMMU/VFIO, that .phy becomes an IOVA the IOMMU translates and bounds-checks.3 Β· Descriptors β a union the driver and NIC take turns writing
A descriptor is a tiny struct describing one buffer. The clever part: it's a union of two layouts over the same bytes β the read format the driverwrites (βDMA the packet to this addressβ) and the write-back format the NICwrites (βdone; here's the lengthβ). They never write it at the same time because the ownership passes back and forth.
// One 16-byte descriptor, two views of the SAME bytes (a union):
union ixgbe_adv_rx_desc {
struct { // READ format β WE write this
__le64 pkt_addr; // where the NIC should DMA the packet
__le64 hdr_addr; // (also doubles as the "hand it back" reset)
} read;
struct { // WRITE-BACK format β the NIC writes this
/* ... */
__le32 status_error; // includes the DD ("descriptor done") bit
__le16 length; // bytes the NIC received
} wb;
};
The handshake flag is the DD bit(βdescriptor doneβ) in status_error: the NIC sets it when it has filled the buffer; the driver clears/repurposes the descriptor to hand ownership back.
4 Β· The ring: βwe own the tail, the hardware owns the headβ
That one comment from ixy's RX function is the entire mental model. The descriptor ring is a circular array; the driver advances the tail register (RDT for RX, TDT for TX) to publish work, and the NIC advances the head (RDH/TDH) as it consumes. Writing the tail register isthe doorbell. It's the same producer/consumer split as the SPSC ring β one side owns each index.
4.5 Β· Bringing the queues up
Before any of that runs, the device has to be reset and configured.reset_and_init() is a strict order: disable interrupts β global reset β wait β init link β set up the RX and TX rings β enable the queues β wait for link. You always reset and configure before you enable, and bring the link up last.
Setting up a queue means handing the NIC the ring's physical base address and length (RDBAL/RDBAH/RDLEN, TDBAL/TDLEN), choosing buffer size and DROP_EN on RX, and the prefetch/write-back thresholds on TX β then flipping the enable bit and polling until the hardware acknowledges.
5 Β· The RX path, for real
This is what βtrace one RX path end to endβ actually looks like. Read each slot; if the NIC set DD, take the packet, refill the slot with a fresh buffer, hand the slot back, and move on; when you hit a slot without DD, stop. Then ring the doorbell once.
// tl;dr: we control the TAIL of the queue, the hardware the HEAD
uint32_t ixgbe_rx_batch(/* ... */ struct pkt_buf *bufs[], uint32_t num_bufs) {
uint16_t rx_index = queue->rx_index;
for (buf_index = 0; buf_index < num_bufs; buf_index++) {
volatile union ixgbe_adv_rx_desc *desc = queue->descriptors + rx_index;
uint32_t status = desc->wb.upper.status_error;
if (status & IXGBE_RXDADV_STAT_DD) { // DD = NIC has filled this slot
struct pkt_buf *buf = queue->virtual_addresses[rx_index];
buf->size = desc->wb.upper.length;
struct pkt_buf *new_buf = pkt_buf_alloc(queue->mempool); // refill
desc->read.pkt_addr = new_buf->buf_addr_phy + offsetof(struct pkt_buf, data);
desc->read.hdr_addr = 0; // hand the slot back to the NIC
queue->virtual_addresses[rx_index] = new_buf;
bufs[buf_index] = buf; // give the packet to the caller
rx_index = wrap_ring(rx_index, queue->num_entries);
} else {
break; // DD not set -> no more packets
}
}
// ring the doorbell: publish how far we've refilled (the TAIL)
set_reg32(dev->addr, IXGBE_RDT(queue_id), last_rx_index);
queue->rx_index = rx_index;
}
- Polling, by default. The hot path just reads the
DDbit in a loop β no interrupt. (ixy canoptionally wait on a VFIO interrupt first, but the datapath is a poll.) That's the kernel-bypass model: trade a burned core for latency, exactly like busy-poll vs. interrupt. - Refill is mandatory. Every packet you take out, you put a fresh buffer back in (
desc->read.pkt_addr = new_buf->buf_addr_phy) β or the ring runs dry and the NIC drops. - The doorbell is one MMIO write (
set_reg32(... RDT ...)), done after the batch β and intentionally one slot behind, so the driver never signalsRDT == RDH(which would mean βfullβ).
6 Β· The TX path β clean, then send, then one doorbell
TX is two halves: reclaim descriptors the NIC has finished (DD set) so you can free those buffers, then post new ones. The RSflag (βreport statusβ) is how you ask the NIC to set DDwhen it's done. The ring is βfullβ when posting one more would catch the cleanup index, so ixy always leaves one slot empty β the classic trick to tell full from empty on a head/tail ring.
uint32_t ixgbe_tx_batch(/* ... */ struct pkt_buf *bufs[], uint32_t num_bufs) {
// step 1: reclaim descriptors the NIC has FINISHED (DD set) β in batches
while (cleanable >= TX_CLEAN_BATCH) {
if (txd[cleanup_to].wb.status & IXGBE_ADVTXD_STAT_DD)
/* free those bufs back to the mempool */;
else
break; // not done yet, stop
}
// step 2: post new packets
for (sent = 0; sent < num_bufs; sent++) {
if (clean_index == next_index) break; // ring full
txd->read.buffer_addr = buf->buf_addr_phy + offsetof(struct pkt_buf, data);
txd->read.cmd_type_len = IXGBE_ADVTXD_DCMD_EOP // last descriptor of the packet
| IXGBE_ADVTXD_DCMD_RS // ask the NIC to report status (DD)
| /* ... */ buf->size;
queue->tx_index = next_index;
}
// ONE doorbell write for the whole batch β never per packet
set_reg32(dev->addr, IXGBE_TDT(queue_id), queue->tx_index);
}
7 Β· The application: ixy-fwd's poll loop
All of the above exists to serve a tiny loop. forward() rx-batches up to 32 packets, touches each one (data[1]++, so the benchmark isn't cheating by leaving packets in cache), tx-batches them to the other port, and dropswhatever the TX ring couldn't take β because waiting would just accumulate latency. main() spins that both directions forever in a single thread(you'd pin it to an isolated core in production); the only off-datapath work is a throttled once-a-second stats print.
// app/ixy-fwd.c β the entire forwarding workload is a poll loop
static void forward(struct ixy_device* rx_dev, uint16_t rx_q,
struct ixy_device* tx_dev, uint16_t tx_q) {
struct pkt_buf* bufs[BATCH_SIZE]; // BATCH_SIZE = 32
uint32_t num_rx = ixy_rx_batch(rx_dev, rx_q, bufs, BATCH_SIZE);
if (num_rx > 0) {
for (uint32_t i = 0; i < num_rx; i++) bufs[i]->data[1]++; // touch each packet
uint32_t num_tx = ixy_tx_batch(tx_dev, tx_q, bufs, num_rx);
for (uint32_t i = num_tx; i < num_rx; i++) pkt_buf_free(bufs[i]); // drop unsent
}
}
int main(/* ... */) {
struct ixy_device* dev1 = ixy_init(argv[1], 1, 1, -1);
struct ixy_device* dev2 = ixy_init(argv[2], 1, 1, 0);
while (true) { // one core, spinning forever, no syscalls
forward(dev1, 0, dev2, 0);
forward(dev2, 0, dev1, 0);
}
}
8 Β· The memory-ordering gem
This is the single best thing to bring up β ixy's actual comment on the line that rings the TX doorbell:
// send out by advancing tail, i.e., pass control of the bufs to the nic
// this seems like a textbook case for a release memory order,
// but Intel's driver doesn't even use a compiler barrier here
set_reg32(dev->addr, IXGBE_TDT(queue_id), queue->tx_index);
release publish. On x86 (TSO) stores are already ordered with respect to each other, so no explicit barrier is needed and Intel omits it. On a weakly-ordered CPU (Arm, POWER, RISC-V) this would be a real bug β you'd need a dma_wmb() before the doorbell. Naming this distinction unprompted is exactly the senior signal from the doorbell drill.How to use this in the interview
Source: github.com/emmericp/ixy and the paper βUser Space Network Driversβ (Emmerich et al.). Excerpts are lightly trimmed for readability.