Why userspace I/O exists.
Before any of the SPDK-specific stuff makes sense, you have to understand what problem userspace I/O is solving in the first place. This is a short primer. It builds the vocabulary every later layer assumes.
- The traditional path: how a user app reads a file today
- Why that path is slow at high IOPS
- What "kernel bypass" actually means
- The trade-offs you make by going to userspace
- What SPDK picks and why
The traditional path: how a user app reads a file today
When a normal application — say, a database — wants to read a block of data from a disk, the request goes through a surprisingly long chain:
read()Each of those steps is fast. But at millions of I/O operations per second, even a small per-step cost compounds. A modern NVMe SSD can do over a million 4KB reads per second. The kernel path, even with all the optimizations, struggles to keep up.
Why that path is slow at high IOPS
There are roughly three classes of overhead in the kernel I/O path:
| Class of overhead | What it costs | Why it's there |
|---|---|---|
| Syscall + context switch | ~1–2 µs per call (a context switch in and out) | Hardware protection: user code can't touch kernel memory directly |
| Interrupt + scheduling | 10s of µs tail latency, especially under load | The device tells the kernel "I'm done" via an interrupt; the kernel has to schedule a handler |
| Data copies | Memory bandwidth, cache pollution | User buffer and kernel buffer are usually separate; data is copied between them |
At a million IOPS with a 4KB payload, you're pushing 4 GB/s of useful data. The syscall and copy overhead, even if "small," is now a significant fraction of CPU time. You're spending more cycles arranging the data movement than actually moving it.
What "kernel bypass" actually means
flowchart LR subgraph Traditional["Traditional I/O path"] A1[App] -->|syscall| A2[Kernel] A2 --> A3[VFS + page cache] A3 --> A4[FS driver] A4 --> A5[Block layer] A5 --> A6[Device driver] A6 -->|DMA| A7[NVMe device] end subgraph Bypass["Kernel bypass (SPDK-style)"] B1[App] --> B2[SPDK in userspace] B2 -->|mmio / doorbell| B3[NVMe device] end A1 -->|more steps = more latency| Traditional B1 -->|fewer steps = less latency| Bypass
fig. 1 The same I/O, two ways. The right side skips the kernel entirely — the app talks to the device's memory-mapped registers directly.
"Kernel bypass" is a marketing term. What it actually means is:
Map the device's registers into your process's address space. Modern PCIe devices expose control registers (doorbells, queues) as memory-mapped I/O. If you can
mmap()them, you don't need the kernel to write to them for you.Allocate DMA-capable memory. The device reads/writes RAM directly. If your buffer is DMA-capable, the device can touch it without kernel help.
Poll the device for completion. No more waiting for an interrupt. You'll see in Layer 0.3 why polling is faster at high IOPS.
The trade-offs you make by going to userspace
Kernel bypass isn't free. Here's the bill:
Kernel safety nets. A bad pointer doesn't segfault; it crashes the whole process. SPDK is full of assert() calls because of this.
Filesystems, permissions, namespaces. All gone. You're talking to raw block devices.
Page cache. You manage your own data placement now.
Portability. Your code now depends on specific PCIe devices, NICs, hugepage setup, etc.
No syscalls in the hot path. Just a write to a memory location.
No copies. Your buffer IS the DMA target.
No interrupt latency. You poll when you want, on your own thread.
Predictable latency. No scheduling surprises.
What SPDK picks and why
SPDK's tagline used to be "putting the 'D' back in 'DAX'." The point: for storage workloads, the kernel is more obstacle than enabler. SPDK bets the trade is worth it for the users who care.
Specifically, SPDK commits to:
- Poll-mode I/O (see 0.3 for why)
- Hugepages for DMA memory (see 0.2 for the hardware context)
- Pinning threads to cores (one reactor per core, see 2.1)
- Cooperative scheduling, not preemptive (no kernel scheduler, your code runs to completion)
What to take away
The reason SPDK exists is that the kernel's I/O path, however well-designed, imposes overheads that matter when you're pushing a million IOPS. The solution is to take the storage stack out of the kernel entirely and run it as a normal userspace process — with all the safety, portability, and ergonomics that entails.
The rest of the curriculum assumes you know this. The next primer page — NVMe at the hardware level — digs into what "talk to the device directly" actually means in NVMe terms.