Layer 0 · Prerequisite primer

Why userspace I/O exists.

Before any of the SPDK-specific stuff makes sense, you have to understand what problem userspace I/O is solving in the first place. This is a short primer. It builds the vocabulary every later layer assumes.

~8 min read1 diagramprerequisite: none
On this page
  1. The traditional path: how a user app reads a file today
  2. Why that path is slow at high IOPS
  3. What "kernel bypass" actually means
  4. The trade-offs you make by going to userspace
  5. What SPDK picks and why

The traditional path: how a user app reads a file today

When a normal application — say, a database — wants to read a block of data from a disk, the request goes through a surprisingly long chain:

STEP 01
App calls read()
User code, in libc
STEP 02
Syscall trap
CPU switches to kernel mode
STEP 03
VFS layer
Virtual filesystem
STEP 04
Page cache lookup
Already in RAM?
STEP 05
Filesystem driver
ext4, xfs, etc.
STEP 06
Block layer + scheduler
Merges, sorts I/O
STEP 07
Device driver
NVMe / SCSI / etc.
STEP 08
DMA to the device
Data lands in RAM
STEP 09
Return to userspace
Syscall return

Each of those steps is fast. But at millions of I/O operations per second, even a small per-step cost compounds. A modern NVMe SSD can do over a million 4KB reads per second. The kernel path, even with all the optimizations, struggles to keep up.

Why that path is slow at high IOPS

There are roughly three classes of overhead in the kernel I/O path:

Class of overheadWhat it costsWhy it's there
Syscall + context switch~1–2 µs per call (a context switch in and out)Hardware protection: user code can't touch kernel memory directly
Interrupt + scheduling10s of µs tail latency, especially under loadThe device tells the kernel "I'm done" via an interrupt; the kernel has to schedule a handler
Data copiesMemory bandwidth, cache pollutionUser buffer and kernel buffer are usually separate; data is copied between them

At a million IOPS with a 4KB payload, you're pushing 4 GB/s of useful data. The syscall and copy overhead, even if "small," is now a significant fraction of CPU time. You're spending more cycles arranging the data movement than actually moving it.

What "kernel bypass" actually means

flowchart LR
subgraph Traditional["Traditional I/O path"]
  A1[App] -->|syscall| A2[Kernel]
  A2 --> A3[VFS + page cache]
  A3 --> A4[FS driver]
  A4 --> A5[Block layer]
  A5 --> A6[Device driver]
  A6 -->|DMA| A7[NVMe device]
end

subgraph Bypass["Kernel bypass (SPDK-style)"]
  B1[App] --> B2[SPDK in userspace]
  B2 -->|mmio / doorbell| B3[NVMe device]
end

A1 -->|more steps = more latency| Traditional
B1 -->|fewer steps = less latency| Bypass
Traditional I/O vs kernel bypass · tap or scroll to zoom · ↗ for fullscreen

fig. 1   The same I/O, two ways. The right side skips the kernel entirely — the app talks to the device's memory-mapped registers directly.

"Kernel bypass" is a marketing term. What it actually means is:

  1. Map the device's registers into your process's address space. Modern PCIe devices expose control registers (doorbells, queues) as memory-mapped I/O. If you can mmap() them, you don't need the kernel to write to them for you.

  2. Allocate DMA-capable memory. The device reads/writes RAM directly. If your buffer is DMA-capable, the device can touch it without kernel help.

  3. Poll the device for completion. No more waiting for an interrupt. You'll see in Layer 0.3 why polling is faster at high IOPS.

The trade-offs you make by going to userspace

Kernel bypass isn't free. Here's the bill:

You lose

Kernel safety nets. A bad pointer doesn't segfault; it crashes the whole process. SPDK is full of assert() calls because of this.

Filesystems, permissions, namespaces. All gone. You're talking to raw block devices.

Page cache. You manage your own data placement now.

Portability. Your code now depends on specific PCIe devices, NICs, hugepage setup, etc.

You gain

No syscalls in the hot path. Just a write to a memory location.

No copies. Your buffer IS the DMA target.

No interrupt latency. You poll when you want, on your own thread.

Predictable latency. No scheduling surprises.

What SPDK picks and why

SPDK's tagline used to be "putting the 'D' back in 'DAX'." The point: for storage workloads, the kernel is more obstacle than enabler. SPDK bets the trade is worth it for the users who care.

Specifically, SPDK commits to:

  • Poll-mode I/O (see 0.3 for why)
  • Hugepages for DMA memory (see 0.2 for the hardware context)
  • Pinning threads to cores (one reactor per core, see 2.1)
  • Cooperative scheduling, not preemptive (no kernel scheduler, your code runs to completion)

What to take away

The reason SPDK exists is that the kernel's I/O path, however well-designed, imposes overheads that matter when you're pushing a million IOPS. The solution is to take the storage stack out of the kernel entirely and run it as a normal userspace process — with all the safety, portability, and ergonomics that entails.

The rest of the curriculum assumes you know this. The next primer page — NVMe at the hardware level — digs into what "talk to the device directly" actually means in NVMe terms.