Layer 0 · Prerequisite primer

Why userspace I/O exists.

Before any of the SPDK-specific stuff makes sense, you have to understand what problem userspace I/O is solving in the first place. This is a short primer. It builds the vocabulary every later layer assumes.

~8 min read1 diagramprerequisite: none

On this page

The traditional path: how a user app reads a file today
Why that path is slow at high IOPS
What "kernel bypass" actually means
The trade-offs you make by going to userspace
What SPDK picks and why

The traditional path: how a user app reads a file today

When a normal application — say, a database — wants to read a block of data from a disk, the request goes through a surprisingly long chain:

STEP 01

App calls read()

User code, in libc

→

STEP 02

Syscall trap

CPU switches to kernel mode

→

STEP 03

VFS layer

Virtual filesystem

→

STEP 04

Page cache lookup

Already in RAM?

→

STEP 05

Filesystem driver

ext4, xfs, etc.

→

STEP 06

Block layer + scheduler

Merges, sorts I/O

→

STEP 07

Device driver

NVMe / SCSI / etc.

→

STEP 08

DMA to the device

Data lands in RAM

→

STEP 09

Return to userspace

Syscall return

Each of those steps is fast. But at millions of I/O operations per second, even a small per-step cost compounds. A modern NVMe SSD can do over a million 4KB reads per second. The kernel path, even with all the optimizations, struggles to keep up.

Why that path is slow at high IOPS

There are roughly three classes of overhead in the kernel I/O path:

Class of overhead	What it costs	Why it's there
Syscall + context switch	~1–2 µs per call (a context switch in and out)	Hardware protection: user code can't touch kernel memory directly
Interrupt + scheduling	10s of µs tail latency, especially under load	The device tells the kernel "I'm done" via an interrupt; the kernel has to schedule a handler
Data copies	Memory bandwidth, cache pollution	User buffer and kernel buffer are usually separate; data is copied between them

At a million IOPS with a 4KB payload, you're pushing 4 GB/s of useful data. The syscall and copy overhead, even if "small," is now a significant fraction of CPU time. You're spending more cycles arranging the data movement than actually moving it.

What "kernel bypass" actually means

flowchart LR
subgraph Traditional["Traditional I/O path"]
  A1[App] -->|syscall| A2[Kernel]
  A2 --> A3[VFS + page cache]
  A3 --> A4[FS driver]
  A4 --> A5[Block layer]
  A5 --> A6[Device driver]
  A6 -->|DMA| A7[NVMe device]
end

subgraph Bypass["Kernel bypass (SPDK-style)"]
  B1[App] --> B2[SPDK in userspace]
  B2 -->|mmio / doorbell| B3[NVMe device]
end

A1 -->|more steps = more latency| Traditional
B1 -->|fewer steps = less latency| Bypass

Traditional I/O vs kernel bypass · tap or scroll to zoom · ↗ for fullscreen

fig. 1 The same I/O, two ways. The right side skips the kernel entirely — the app talks to the device's memory-mapped registers directly.

"Kernel bypass" is a marketing term. What it actually means is:

Map the device's registers into your process's address space. Modern PCIe devices expose control registers (doorbells, queues) as memory-mapped I/O. If you can mmap() them, you don't need the kernel to write to them for you.
Allocate DMA-capable memory. The device reads/writes RAM directly. If your buffer is DMA-capable, the device can touch it without kernel help.
Poll the device for completion. No more waiting for an interrupt. You'll see in Layer 0.3 why polling is faster at high IOPS.

The trade-offs you make by going to userspace

Kernel bypass isn't free. Here's the bill:

You lose

Kernel safety nets. A bad pointer doesn't segfault; it crashes the whole process. SPDK is full of assert() calls because of this.

Filesystems, permissions, namespaces. All gone. You're talking to raw block devices.

Page cache. You manage your own data placement now.

Portability. Your code now depends on specific PCIe devices, NICs, hugepage setup, etc.

You gain

No syscalls in the hot path. Just a write to a memory location.

No copies. Your buffer IS the DMA target.

No interrupt latency. You poll when you want, on your own thread.

Predictable latency. No scheduling surprises.

What SPDK picks and why

SPDK's tagline used to be "putting the 'D' back in 'DAX'." The point: for storage workloads, the kernel is more obstacle than enabler. SPDK bets the trade is worth it for the users who care.

Specifically, SPDK commits to:

Poll-mode I/O (see 0.3 for why)
Hugepages for DMA memory (see 0.2 for the hardware context)
Pinning threads to cores (one reactor per core, see 2.1)
Cooperative scheduling, not preemptive (no kernel scheduler, your code runs to completion)

What to take away

The reason SPDK exists is that the kernel's I/O path, however well-designed, imposes overheads that matter when you're pushing a million IOPS. The solution is to take the storage stack out of the kernel entirely and run it as a normal userspace process — with all the safety, portability, and ergonomics that entails.

The rest of the curriculum assumes you know this. The next primer page — NVMe at the hardware level — digs into what "talk to the device directly" actually means in NVMe terms.