Real driver~1,000 lines of CπŸ“‚ Browse the whole repo β†’

Reading a real driver: ixy

Everything on the datapath page and in the earn-the-vocabulary drills is abstract until you see it in real code. ixy is an educational userspace driver for Intel 82599 (ixgbe) NICs β€” a few hundred lines of readable C that does the whole RX/TX datapath. If you can talk through these excerpts, the β€œhave you written a driver?” gap shrinks to β€œI've read and understood one.” I cloned the repo and walked it end to end; below is the tour, illustrated.

0 Β· The whole repo at a glance

ixy exposes one generic struct ixy_device to your application β€” a little C vtable of function pointers (rx_batch, tx_batch, …). The app calls the inline stub ixy_rx_batch(), which dispatches through the pointer to the real ixgbe_rx_batch() (or the virtio one). The driver recovers its private struct with container_of β€” the same trick the Linux kernel uses. At startup ixy_init() reads PCI config space, checks the device class is a NIC, and installs the ixgbe (or virtio) function pointers based on the vendor ID β€” runtime polymorphism from a plain C struct.

Layered architecture: app to ixy_device vtable to ixgbe driver to memory/pci/vfio to the NIC over PCIe
The layers: your app β†’ the device vtable β†’ the ixgbe driver β†’ memory.c / pci.c / vfio β†’ the NIC over PCIe.
filedevice.h β€” the vtable
// device.h β€” ONE generic device the app talks to; each driver fills in the pointers
struct ixy_device {
    const char* pci_addr;
    const char* driver_name;
    uint16_t num_rx_queues, num_tx_queues;
    uint32_t (*rx_batch)(struct ixy_device*, uint16_t q, struct pkt_buf* bufs[], uint32_t n);
    uint32_t (*tx_batch)(struct ixy_device*, uint16_t q, struct pkt_buf* bufs[], uint32_t n);
    // ... read_stats, set_promisc, get/set_mac_addr ...
};

// the app never calls ixgbe code directly β€” it calls this stub, which dispatches
static inline uint32_t ixy_rx_batch(struct ixy_device* dev, uint16_t q,
                                    struct pkt_buf* bufs[], uint32_t n) {
    return dev->rx_batch(dev, q, bufs, n);   // -> ixgbe_rx_batch (or virtio_rx_batch)
}

That indirection is the whole reason a 60-line ixy-fwd app can drive two completely different NIC families without changing a line.

1 Β· Mapping the NIC: PCIe & MMIO

Before you can touch a register you have to own the device. ixy unbinds the kernel driver, flips the bus-master bit in PCI config space (so the NIC is allowed to DMA), and mmaps the device's BAR0 (/sys/bus/pci/devices/<addr>/resource0) into its own address space. After that, a NIC register is just hw + reg.

BAR0 mmap'd into the process address space; set_reg32 writes cross PCIe to NIC registers
mmap turns the NIC's register block into ordinary memory; a volatile store to it becomes a PCIe transaction.
filepci.c β€” pci_map_resource
// pci.c β€” turn a PCI device into a pointer you can write
uint8_t* pci_map_resource(const char* pci_addr) {
    remove_driver(pci_addr);                 // unbind the kernel's ixgbe driver
    enable_dma(pci_addr);                    // set the bus-master bit in PCI config space
    int fd = open("/sys/bus/pci/devices/<addr>/resource0", O_RDWR);
    struct stat st; fstat(fd, &st);
    // BAR0 is now a normal memory mapping in our address space:
    uint8_t* hw = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    close(fd);
    return hw;                               // hw + reg  ==  a NIC register
}
filedevice.h β€” set_reg32
// device.h β€” an MMIO register write is just a volatile store to the mapped BAR
static inline void set_reg32(uint8_t* addr, int reg, uint32_t value) {
    __asm__ volatile ("" : : : "memory");    // compiler barrier β€” NOT a CPU barrier
    *((volatile uint32_t*) (addr + reg)) = value;
}
The interview line:β€œMMIO means a register write is literally a volatile store to an mmapped BAR; the CPU's memory controller routes it across PCIe to the device. ixy only needs a compilerbarrier on x86 β€” the hardware keeps stores ordered β€” but on a weakly-ordered CPU you'd need a real barrier.”

2 Β· DMA β€” the NIC needs physical addresses

The NIC is a bus master: it reads and writes host RAM by physical (or IOMMU) address β€” it knows nothing of your process's virtual addresses. So a userspace driver has to do two unusual things: get the physical address of a buffer, and make sure that buffer never moves or swaps out.

A 2MB hugepage carved into fixed-size pkt_bufs, each carrying its own physical address from a mempool free-list
One pinned hugepage, carved into fixed-size pkt_bufs; each buffer caches its own physical address.
filememory.c β€” virt_to_phys
// memory.c β€” translate a virtual address to a physical one via /proc/self/pagemap
static uintptr_t virt_to_phys(void *virt) {
    long pagesize = sysconf(_SC_PAGESIZE);
    int fd = open("/proc/self/pagemap", O_RDONLY);
    lseek(fd, (uintptr_t)virt / pagesize * sizeof(uintptr_t), SEEK_SET);
    uintptr_t phy = 0;
    read(fd, &phy, sizeof(phy));
    close(fd);
    // bits 0-54 are the page number
    return (phy & 0x7fffffffffffffULL) * pagesize + (uintptr_t)virt % pagesize;
}

The buffers come from huge pages (fewer TLB entries, and 2 MB of contiguous physical memory) that are mlocked so the kernel can never swap them while the NIC might DMA into them:

filememory.c β€” memory_allocate_dma
// memory.c β€” memory the NIC can DMA into: huge pages, pinned, never swapped
struct dma_memory memory_allocate_dma(size_t size, bool require_contiguous) {
    // ... open a file on hugetlbfs (/mnt/huge), round size up to 2 MB ...
    void *virt = mmap(NULL, size, PROT_READ | PROT_WRITE,
                      MAP_SHARED | MAP_HUGETLB, fd, 0);
    mlock(virt, size);                    // never swap out DMA memory
    return (struct dma_memory){
        .virt = virt,
        .phy  = virt_to_phys(virt)        // <-- the address the NIC actually uses
    };
}

That region is then carved into a mempool of fixed-size pkt_bufs. Each buffer caches its own buf_addr_phy once, so the hot path never has to translate an address again β€” it just reads the field:

filememory.c β€” memory_allocate_mempool
// memory.c β€” carve the DMA region into fixed-size packet buffers
struct mempool* memory_allocate_mempool(uint32_t num_entries, uint32_t entry_size) {
    entry_size = entry_size ? entry_size : 2048;          // must divide the hugepage
    struct dma_memory mem = memory_allocate_dma(num_entries * entry_size, false);
    // ... mempool->base_addr = mem.virt; a free-stack of entry ids ...
    for (uint32_t i = 0; i < num_entries; i++) {
        struct pkt_buf* buf = mempool->base_addr + i * entry_size;
        buf->buf_addr_phy = virt_to_phys(buf);   // each buf carries its OWN phys addr
        buf->mempool_idx  = i;
        buf->mempool      = mempool;
    }
    return mempool;
}
The mempool free-stack: alloc pops an id, free pushes it back β€” O(1), single-threaded
Allocation is a LIFO free-stack: pop an id on alloc, push it on free β€” O(1), no locks (single producer/consumer).
The interview line:β€œA NIC DMAs to physical addresses, so the driver pins the pages and hands the device a physical (or IOVA) address β€” virtual pointers are meaningless to the hardware. ixy reads /proc/self/pagemapfor the translation and uses hugepages so the memory is contiguous and never swapped.” With an IOMMU/VFIO, that .phy becomes an IOVA the IOMMU translates and bounds-checks.

3 Β· Descriptors β€” a union the driver and NIC take turns writing

A descriptor is a tiny struct describing one buffer. The clever part: it's a union of two layouts over the same bytes β€” the read format the driverwrites (β€œDMA the packet to this address”) and the write-back format the NICwrites (β€œdone; here's the length”). They never write it at the same time because the ownership passes back and forth.

fileixgbe_type.h (simplified)
// One 16-byte descriptor, two views of the SAME bytes (a union):
union ixgbe_adv_rx_desc {
    struct {                 // READ format β€” WE write this
        __le64 pkt_addr;     //   where the NIC should DMA the packet
        __le64 hdr_addr;     //   (also doubles as the "hand it back" reset)
    } read;
    struct {                 // WRITE-BACK format β€” the NIC writes this
        /* ... */
        __le32 status_error; //   includes the DD ("descriptor done") bit
        __le16 length;       //   bytes the NIC received
    } wb;
};

The handshake flag is the DD bit(β€œdescriptor done”) in status_error: the NIC sets it when it has filled the buffer; the driver clears/repurposes the descriptor to hand ownership back.

RX descriptor union: read format (pkt_addr, hdr_addr) the driver writes vs write-back format (status_error/DD, length) the NIC writes
The RX descriptor union: the driver writes the read format; the NIC overwrites it with the write-back (DD, length).
TX descriptor cmd_type_len bit field: EOP, RS, IFCS, DEXT, DTYP and the data length
The TX descriptor's cmd_type_len bit field β€” EOP marks the last buffer, RS asks the NIC to set DD on completion.

4 Β· The ring: β€œwe own the tail, the hardware owns the head”

That one comment from ixy's RX function is the entire mental model. The descriptor ring is a circular array; the driver advances the tail register (RDT for RX, TDT for TX) to publish work, and the NIC advances the head (RDH/TDH) as it consumes. Writing the tail register isthe doorbell. It's the same producer/consumer split as the SPSC ring β€” one side owns each index.

4.5 Β· Bringing the queues up

Before any of that runs, the device has to be reset and configured.reset_and_init() is a strict order: disable interrupts β†’ global reset β†’ wait β†’ init link β†’ set up the RX and TX rings β†’ enable the queues β†’ wait for link. You always reset and configure before you enable, and bring the link up last.

reset_and_init sequence: disable interrupts, global reset, EEPROM, init_link, init_rx, init_tx, start queues, wait for link
reset_and_init(): reset, then configure, then enable β€” link comes last.

Setting up a queue means handing the NIC the ring's physical base address and length (RDBAL/RDBAH/RDLEN, TDBAL/TDLEN), choosing buffer size and DROP_EN on RX, and the prefetch/write-back thresholds on TX β€” then flipping the enable bit and polling until the hardware acknowledges.

RX queue setup registers: SRRCTL with DROP_EN, RDBAL/RDBAH, RDLEN, RDH/RDT, enable RXEN
init_rx: program the ring base/length and SRRCTL (with DROP_EN), then fill buffers and set RDT to the last slot.
TX queue setup: TXDCTL thresholds, TDBAL/TDLEN, DMATXCTL.TE enable
init_tx: program the ring, set the descriptor thresholds, enable TX DMA, then enable the queue and wait.

5 Β· The RX path, for real

This is what β€œtrace one RX path end to end” actually looks like. Read each slot; if the NIC set DD, take the packet, refill the slot with a fresh buffer, hand the slot back, and move on; when you hit a slot without DD, stop. Then ring the doorbell once.

Animated RX descriptor ring: NIC head fills slots and sets DD; driver tail consumes and refills, then rings RDT
The NIC advances the head (RDH) setting DD; the driver polls, consumes, refills, and advances the tail (RDT).
fileixgbe.c β€” ixgbe_rx_batch
// tl;dr: we control the TAIL of the queue, the hardware the HEAD
uint32_t ixgbe_rx_batch(/* ... */ struct pkt_buf *bufs[], uint32_t num_bufs) {
    uint16_t rx_index = queue->rx_index;
    for (buf_index = 0; buf_index < num_bufs; buf_index++) {
        volatile union ixgbe_adv_rx_desc *desc = queue->descriptors + rx_index;
        uint32_t status = desc->wb.upper.status_error;

        if (status & IXGBE_RXDADV_STAT_DD) {        // DD = NIC has filled this slot
            struct pkt_buf *buf = queue->virtual_addresses[rx_index];
            buf->size = desc->wb.upper.length;

            struct pkt_buf *new_buf = pkt_buf_alloc(queue->mempool);   // refill
            desc->read.pkt_addr = new_buf->buf_addr_phy + offsetof(struct pkt_buf, data);
            desc->read.hdr_addr = 0;                // hand the slot back to the NIC
            queue->virtual_addresses[rx_index] = new_buf;

            bufs[buf_index] = buf;                  // give the packet to the caller
            rx_index = wrap_ring(rx_index, queue->num_entries);
        } else {
            break;                                  // DD not set -> no more packets
        }
    }
    // ring the doorbell: publish how far we've refilled (the TAIL)
    set_reg32(dev->addr, IXGBE_RDT(queue_id), last_rx_index);
    queue->rx_index = rx_index;
}
  • Polling, by default. The hot path just reads the DD bit in a loop β€” no interrupt. (ixy canoptionally wait on a VFIO interrupt first, but the datapath is a poll.) That's the kernel-bypass model: trade a burned core for latency, exactly like busy-poll vs. interrupt.
  • Refill is mandatory. Every packet you take out, you put a fresh buffer back in (desc->read.pkt_addr = new_buf->buf_addr_phy) β€” or the ring runs dry and the NIC drops.
  • The doorbell is one MMIO write (set_reg32(... RDT ...)), done after the batch β€” and intentionally one slot behind, so the driver never signals RDT == RDH(which would mean β€œfull”).

6 Β· The TX path β€” clean, then send, then one doorbell

TX is two halves: reclaim descriptors the NIC has finished (DD set) so you can free those buffers, then post new ones. The RSflag (β€œreport status”) is how you ask the NIC to set DDwhen it's done. The ring is β€œfull” when posting one more would catch the cleanup index, so ixy always leaves one slot empty β€” the classic trick to tell full from empty on a head/tail ring.

Animated TX ring: reclaim DD-set descriptors, post new ones, then a single TDT doorbell write per batch
Reclaim finished descriptors, post the batch, then ring TDT exactly once β€” never per packet.
fileixgbe.c β€” ixgbe_tx_batch
uint32_t ixgbe_tx_batch(/* ... */ struct pkt_buf *bufs[], uint32_t num_bufs) {
    // step 1: reclaim descriptors the NIC has FINISHED (DD set) β€” in batches
    while (cleanable >= TX_CLEAN_BATCH) {
        if (txd[cleanup_to].wb.status & IXGBE_ADVTXD_STAT_DD)
            /* free those bufs back to the mempool */;
        else
            break;                                  // not done yet, stop
    }
    // step 2: post new packets
    for (sent = 0; sent < num_bufs; sent++) {
        if (clean_index == next_index) break;       // ring full
        txd->read.buffer_addr = buf->buf_addr_phy + offsetof(struct pkt_buf, data);
        txd->read.cmd_type_len = IXGBE_ADVTXD_DCMD_EOP   // last descriptor of the packet
                               | IXGBE_ADVTXD_DCMD_RS    // ask the NIC to report status (DD)
                               | /* ... */ buf->size;
        queue->tx_index = next_index;
    }
    // ONE doorbell write for the whole batch β€” never per packet
    set_reg32(dev->addr, IXGBE_TDT(queue_id), queue->tx_index);
}
Why batching is the whole game. ixy's own comment: β€œhuge performance gains possible here by sending packets in batches β€” writing to TDT for every packet is not efficient.” One doorbell (an expensive MMIO write) amortised over many packets. This is the throughput-vs-latency, amortise-the-fixed-cost instinct the drills stress.

7 Β· The application: ixy-fwd's poll loop

All of the above exists to serve a tiny loop. forward() rx-batches up to 32 packets, touches each one (data[1]++, so the benchmark isn't cheating by leaving packets in cache), tx-batches them to the other port, and dropswhatever the TX ring couldn't take β€” because waiting would just accumulate latency. main() spins that both directions forever in a single thread(you'd pin it to an isolated core in production); the only off-datapath work is a throttled once-a-second stats print.

Animated forwarding loop: packets flow from port 1 through rx_batch, touch, tx_batch to port 2 and back
The whole workload: rx_batch β†’ touch β†’ tx_batch β†’ drop unsent, looped over both ports on a single spinning core.
fileapp/ixy-fwd.c β€” forward + main
// app/ixy-fwd.c β€” the entire forwarding workload is a poll loop
static void forward(struct ixy_device* rx_dev, uint16_t rx_q,
                    struct ixy_device* tx_dev, uint16_t tx_q) {
    struct pkt_buf* bufs[BATCH_SIZE];                       // BATCH_SIZE = 32
    uint32_t num_rx = ixy_rx_batch(rx_dev, rx_q, bufs, BATCH_SIZE);
    if (num_rx > 0) {
        for (uint32_t i = 0; i < num_rx; i++) bufs[i]->data[1]++;  // touch each packet
        uint32_t num_tx = ixy_tx_batch(tx_dev, tx_q, bufs, num_rx);
        for (uint32_t i = num_tx; i < num_rx; i++) pkt_buf_free(bufs[i]);  // drop unsent
    }
}

int main(/* ... */) {
    struct ixy_device* dev1 = ixy_init(argv[1], 1, 1, -1);
    struct ixy_device* dev2 = ixy_init(argv[2], 1, 1, 0);
    while (true) {                  // one core, spinning forever, no syscalls
        forward(dev1, 0, dev2, 0);
        forward(dev2, 0, dev1, 0);
    }
}
Drop, don't block.Choosing to free unsent packets rather than busy-wait on TX is a real latency decision β€” the same β€œshed load to protect tail latency” instinct that shows up in incast and congestion.

8 Β· The memory-ordering gem

This is the single best thing to bring up β€” ixy's actual comment on the line that rings the TX doorbell:

fileixgbe.c β€” ixgbe_tx_batch
// send out by advancing tail, i.e., pass control of the bufs to the nic
// this seems like a textbook case for a release memory order,
// but Intel's driver doesn't even use a compiler barrier here
set_reg32(dev->addr, IXGBE_TDT(queue_id), queue->tx_index);
Why it's safe here but not everywhere. The descriptor stores must be visible to the NIC before the doorbell store β€” a classic release publish. On x86 (TSO) stores are already ordered with respect to each other, so no explicit barrier is needed and Intel omits it. On a weakly-ordered CPU (Arm, POWER, RISC-V) this would be a real bug β€” you'd need a dma_wmb() before the doorbell. Naming this distinction unprompted is exactly the senior signal from the doorbell drill.

How to use this in the interview

You don't have to claim you've shipped a driver. Say: β€œI worked through ixy, a real userspace ixgbe driver, to make the datapath concrete β€” the RX poll loop on the DD bit, refilling and handing slots back via the tail register, batched TX with a single doorbell, and the physical-address/pinned-hugepage DMA setup. It made the descriptor-ownership and barrier-before-doorbell ideas click, because I'd debugged the same ownership class of bug across the MCU-DSP boundary at MediaTek.”That's honest, specific, and earns the vocabulary.

Source: github.com/emmericp/ixy and the paper β€œUser Space Network Drivers” (Emmerich et al.). Excerpts are lightly trimmed for readability.