Layer 0 · Prerequisite primer

NVMe at the hardware level.

The previous page explained why SPDK runs in userspace. This page explains what it talks to. An NVMe device isn't a "disk" — it's a PCIe endpoint with a memory-mapped register file, a small set of in-memory ring buffers, and a doorbell to get its attention. Once you see the hardware, the rest of SPDK starts to feel inevitable.

~12 min read1 diagramprerequisite: 0.1 — Why userspace I/O exists
On this page
  1. What "an NVMe device" actually is
  2. PCIe, BARs, and MMIO — how a CPU talks to a device
  3. The NVMe register file — what lives at the top of BAR0
  4. Submission queues and completion queues
  5. Doorbell registers: how a single write kicks the device
  6. An NVMe command, byte by byte
  7. Admin queue vs I/O queues
  8. MSI-X interrupts — and why SPDK turns them off
  9. DMA, PRPs, and SGLs: getting data into RAM
  10. Namespaces — one SSD, many "disks"
  11. Source walkthrough: the SPDK register struct
  12. Edge cases: what trips people up

What "an NVMe device" actually is

When you say "NVMe SSD" in casual conversation, you usually mean "a fast block-storage thing." When the kernel — or SPDK — talks to one, it means something much more specific:

  1. A PCIe endpoint. The device is a card on a PCIe bus with its own device ID, vendor ID, and BARs (Base Address Registers). It shows up under lspci like any other PCIe device.

  2. A memory-mapped register file. The device exposes a region of its internal registers to the CPU by claiming one or more BAR windows. The CPU accesses them with normal load/store instructions — no port I/O, no special instructions, no syscalls.

  3. A small DMA engine inside the device. The device can read from and write to system RAM directly, without the CPU's involvement, by performing PCIe DMA transactions.

  4. A protocol, defined by the NVM Express specification, that describes what those registers mean and what the DMA engine should do in response. The spec is the contract; the device and driver both implement it.

That's it. There is no "NVMe driver" inside the device in the way most people imagine — just a piece of silicon that reacts to register writes and to commands the host leaves in a ring buffer. The interesting work happens because both sides agree on the spec.

PCIe, BARs, and MMIO — how a CPU talks to a device

A PCIe device is enumerated at boot. The system firmware (or, with hotplug, the kernel) walks the PCIe tree, assigns each device a Bus:Device.Function address (the BDF), and reads the device's BAR registers. Each BAR describes a window of physical addresses that the device will respond to.

The OS then programs the IOMMU (if any) and the root complex, and mmap()s those addresses into a process's virtual address space. From the process's point of view, the device is just a region of memory: writing to a specific address sets a register, reading from a specific address returns a status bit.

For an NVMe controller, the relevant BAR is BAR0. The size of BAR0 is small — typically 4 KiB to 16 KiB. Most of those bytes are the controller register file. The rest, growing toward the end of the address range, is the doorbell array: one doorbell per queue pair, and there can be thousands of queue pairs.

The NVMe register file — what lives at the top of BAR0

The first few hundred bytes of BAR0 are the NVMe controller registers. The exact layout is mandated by the spec and mirrored in SPDK as struct spdk_nvme_registers in include/spdk/nvme_spec.h:534 . The fields that matter for this primer:

RegisterOffsetWhat it does
cap0x00Controller capabilities. Read-only. Tells the host what queue depths, queue counts, and features the device supports.
vs0x08Version of the NVMe spec this controller implements.
cc0x14Controller configuration. Host-writable. Enables the controller, sets the I/O queue entry size, etc.
csts0x1CController status. Device-writable. Tells the host whether the controller is ready, processing, or in a fatal state.
aqa0x24Admin queue attributes. Size of the admin submission and completion queues.
asq / acq0x28 / 0x30Admin submission / completion queue physical base addresses. The host puts the ring buffers in RAM and tells the device where.
doorbell[]0x1000+One submission-queue tail doorbell + one completion-queue head doorbell per queue pair. The "kick the device" registers.

The two most important facts to internalize are:

  1. The admin queues are pointed to by registers (asq and acq). Before the host can submit any I/O, it allocates two ring buffers in RAM, writes their physical addresses into those registers, and then writes cc.EN = 1 to bring the controller online.

  2. After that, every interaction is via ring buffers and doorbells. The spec defines how the host fills a submission queue, rings the doorbell, and the device posts a completion.

Submission queues and completion queues

Once the admin queue is set up, the host can create more queues — for actual I/O — by submitting admin commands like Create I/O Submission Queue and Create I/O Completion Queue. Each created queue is identified by a 16-bit ID, and a pair of (SQ, CQ) is called a queue pair.

flowchart LR
subgraph CPU["Host CPU"]
  H[SPDK process]
end

subgraph MEM["System RAM (DMA-capable)"]
  SQ["Submission Queue
spdk_nvme_cmd, 64 B each
rings: host writes, device reads"] CQ["Completion Queue
spdk_nvme_cpl, 16 B each
rings: device writes, host reads"] DATA["Data buffers (PRP / SGL)"] end subgraph PCIE["PCIe endpoint (NVMe device)"] REG["Controller registers
BAR0: cap, cc, csts, asq, acq, ..."] DB["Doorbell array
sq_tdbl, cq_hdbl per qpair"] CTRL["Controller silicon
DMA engine + queues"] end H -- "mmap BAR0" --> REG H -- "write commands into SQ slot" --> SQ H -- "write tail to doorbell" --> DB DB -- "kicks controller" --> CTRL CTRL -- "DMA data in/out" --> DATA CTRL -- "posts cpl to CQ" --> CQ H -- "poll CQ for new completions" --> CQ
fig. 1 — A PCIe NVMe controller, memory-mapped · tap or scroll to zoom · ↗ for fullscreen

fig. 1   The host and device share three pieces of memory: submission queue (host-writable, device-readable), completion queue (device-writable, host-readable), and data buffers (DMA in both directions). The doorbell is the only way the host tells the device "new commands are ready." A typical SQ is a circular array of 64-byte spdk_nvme_cmd entries; a typical CQ is a circular array of 16-byte spdk_nvme_cpl entries. Both must live in physically contiguous, DMA-addressable memory.

The shape of those ring entries is fixed by the spec. From include/spdk/nvme_spec.h:1452 :

The completion entry is shorter — 16 bytes, fixed by the spec. From include/spdk/nvme_spec.h:1519 :

Doorbell registers: how a single write kicks the device

The doorbell is the cheapest possible interface between the host and the device. The host writes a single 32-bit value to a specific MMIO address, and the device knows "the host has just placed one or more new commands in submission queue N — go look at slot T."

The "specific MMIO address" is calculated from the queue ID and the device's dstrd (doorbell stride) capability. In SPDK, that math is in lib/nvme/nvme_pcie_common.c:229 :

Doorbell writes go to posted PCIe writes (writes with no completion), which means they reach the device very quickly — a few hundred nanoseconds on a modern PCIe link. The "ring the doorbell" idiom you'll see all over SPDK is literally *sq_tdbl = tail; on a volatile pointer.

An NVMe command, byte by byte

A host submits work by writing one or more spdk_nvme_cmd entries into the submission queue, advancing the SQ tail, and ringing the doorbell. The device posts a completion by writing one spdk_nvme_cpl into the CQ, advancing the CQ head, and (if interrupts are enabled) signaling an MSI-X vector.

STEP 01
1. Build cmd
fill in opcode, nsid, LBA, PRP/SGL
STEP 02
2. Copy to SQ
into slot (sq_tail % num_entries)
STEP 03
3. Advance tail
sq_tail = (sq_tail + 1) % num_entries
STEP 04
4. Ring doorbell
MMIO write of new tail to sq_tdbl
STEP 05
5. Device fetches
DMA-read of the cmd from RAM
STEP 06
6. Device executes
does the actual I/O (or admin op)
STEP 07
7. Device posts cpl
DMA-write of spdk_nvme_cpl to CQ
STEP 08
8. (Optional) IRQ
MSI-X if interrupts enabled
STEP 09
9. Host polls CQ
or gets woken by IRQ
STEP 10
10. Host reaps cpl
advance cq_head, ring cq_hdbl

The "ring the doorbell" line is the entire hot path. Everything else is RAM and DMA. That's why an NVMe SSD can sustain a million IOPS with a modern CPU doing nothing but touching cache lines.

Admin queue vs I/O queues

Every NVMe controller has exactly one admin queue pair (QID 0, capacity up to 4096 entries per the spec). It is created automatically when the controller is enabled — the host tells the device where the admin SQ and CQ live in RAM by writing asq/acq and setting cc.EN = 1.

The admin queue handles housekeeping:

  • Identify — query the controller's identity, capabilities, namespace list.
  • Create I/O Completion Queue / Submission Queue — set up the data-plane queues.
  • Delete I/O CQ / SQ — tear them down.
  • Set Features / Get Features — toggle things like number of queues, interrupt coalescing, arbitration.
  • Format NVM, Firmware Commit, Sanitize — admin-level operations that affect the whole device.
  • Get Log Page, Async Event Request — telemetry and asynchronous event delivery.

I/O queues (QID 1..N) handle the actual data movement: read, write, flush, write-zeroes, dataset management, compare, vendor-specific commands. The host creates them in any number up to the device's MQES capability, picks their depth, and tells the device via the admin queue. After that, the host never touches the admin queue again — it stays out of the way while I/O is in flight.

MSI-X interrupts — and why SPDK turns them off

A modern NVMe device can signal the host that work is complete by firing an MSI-X interrupt — a message-signaled interrupt routed over PCIe to a specific CPU. The OS handles the interrupt by running a registered handler, which (in the kernel's case) is the NVMe driver's irq_handler.

SPDK's default mode is no interrupts. The user can turn them on (e.g. --enable-interrupts for spdk_nvme_perf), but the default is polling. Why?

  • Interrupt latency. On a busy system, the time from "device fires MSI-X" to "host code runs" is typically 5–50 µs. Polling, by contrast, can react in <1 µs.

  • Tail latency. Under load, the worst-case interrupt coalescing delay (the device waits a few hundred µs hoping to batch completions) makes p99.9 / p99.99 latencies bursty. Polling eliminates coalescing.

  • OS noise. An interrupt handler that runs in the kernel can be preempted, can be delayed by softirqs, can be migrated to another CPU. Polling is bound to one thread on one core, with nothing in between.

The next page in this primer (0.3) goes deep on this trade-off. For now, the rule is: SPDK has interrupts available — see

lib/nvme/nvme_pcie.c:1049

(spdk_pci_device_disable_interrupts) and

lib/nvme/nvme_poll_group.c:87

(spdk_nvme_poll_group_set_interrupt_callback) — but the fast path doesn't use them.

DMA, PRPs, and SGLs: getting data into RAM

The device moves data with PCIe DMA — it can read from and write to system RAM directly, as long as the host has set up the IOMMU and given the device a bus address to use. There is no "send me a buffer" control command; the host just hands the device a list of physical addresses in the submission queue entry, and the device DMAs to/from them.

NVMe supports two ways to describe data buffers:

MechanismWhat it isWhen you'd use it
PRP (Physical Region Page)One or two physical page addresses (4 KiB each, optionally chained via a list pointer).Default for reads and writes. PRP1 + PRP2 cover up to two pages directly; for larger I/O, PRP2 points to a PRP list.
SGL (Scatter-Gather List)A flexible descriptor chain — can be a data block, a bit bucket, a segment, or a key for encryption.Vendor-specific features, complex I/O patterns, key-value style commands (e.g. ZNS, OCP).

The psdt field in the SQ entry tells the device which one to interpret dptr as. The choice is per-command, and a given device may support only one of them. Most reads and writes use PRP.

Namespaces — one SSD, many "disks"

An NVMe controller may expose one or more namespaces (NSID 1..N). A namespace is a range of logical blocks with a single LBA format. A single physical SSD can present:

  • One namespace covering the whole device (typical consumer drives).
  • Many namespaces, each a slice of the device, with independent LBA formats and block sizes (typical enterprise drives; ZNS SSDs).
  • No namespaces at all during early boot — the host has to issue Identify Namespace to discover them.

From the host's point of view, each namespace is independent: a read to NSID 5 doesn't interact with a read to NSID 7. The I/O command's nsid field selects which namespace the operation targets.

In SPDK, each namespace is a struct spdk_nvme_ns, and you call spdk_nvme_ns_cmd_read(ns, qpair, buf, lba, lba_count, cb, arg). Underneath, that helper populates the SQ entry's nsid with the namespace's ID, sets cdw10/cdw11 to the starting LBA and the LBA count, and rings the doorbell.

Source walkthrough: how SPDK actually rings a doorbell

Here is the part of the data path that turns a spdk_nvme_ns_cmd_read call into an MMIO write. It is small enough to read in one sitting, and it captures all the ideas from this page:

Edge cases & what trips people up

1. What if you submit to a full queue?

The queue is "full" when (sq_tail + 1) % num_entries == sq_head. In SPDK, this is tracked by pqpair->qpair.queue_depth, and the public API guarantees you get -ENOMEM back if you try to submit when depth is at the cap. The device will not reject your submit; it will silently accept it and post an "async event" or behave undefined. The host has to police this — and SPDK does, via trackers.

2. What if the completion comes before you ring the doorbell?

It can't. The device only posts a completion after it has fetched the command, which it only does after the doorbell. But completions can arrive in any order relative to other submissions on the same qpair, because the device may reorder them internally for performance. The host matches command to completion by the cid field, not by order.

3. The phase tag (the single most subtle thing in NVMe)

The CQ is a circular buffer. The host advances cq_head to consume completions. The device advances cq_tail to post new ones. When the queue wraps, both pointers reset. Without a flag, the host can't tell "new completion" from "stale slot."

The p bit in the completion's status word is the flag. The device sets p to 1 on even passes around the queue and 0 on odd passes. The host remembers the last p it saw; if the next cpl has a different p, the queue has wrapped and the slot is fresh. If the p matches what was already there, it's a stale entry.

4. Hot-remove while I/O is in flight

A PCIe NVMe device can be unplugged at any time. The next MMIO access to BAR0 — including a doorbell write — would normally raise a SIGBUS and kill the process. SPDK installs a SIGBUS handler that detects this case and remaps BAR0 to a placeholder region. After that, all I/O completes with an error (rather than crashing the process). The flow is documented in doc/nvme.md §Hotplug; the implementation lives in the PCIe transport, around the BAR-mapping code in lib/nvme/nvme_pcie.c:770 .

5. The admin queue is QID 0 — forever

You can delete I/O queues, but the admin queue is permanent. It is set up at cc.EN = 1 time and torn down at controller reset. If your code accidentally tries to delete QID 0, SPDK will refuse; if you bypass the API and tell the device to do it, you will lose the ability to do anything else, and the only recovery is a controller reset.

6. The NVMe spec is the spec, not the device

A given device may support only a subset of the spec, may interpret fields in a vendor-specific way, and may have quirks. SPDK maintains a quirks database in lib/nvme/nvme_quirks.c and applies it during controller construction. The spec tells you what can happen; the device tells you what does happen. Trust the spec for protocol, but always check device-specific behavior before assuming it works.

Why it matters

When you read later explainers — reactor, bdev, lvol, nvmf — you will see words like qpair, tracker, doorbell, PRP, namespace, phase tag. They are not abstractions on top of NVMe; they are NVMe. Every bdev module that issues I/O eventually calls into nvme_pcie_qpair_submit_tracker (or a fabric-transport cousin of it). The whole bdev framework is a way to multiplex many logical "disks" onto a small set of physical NVMe queue pairs, one per CPU core, all submitting by ringing doorbells.

When you see a hung submission, you know what to check: the SQ tail, the doorbell, the device's CQ head, the phase tag. When you see a completion you don't expect, you know what to check: the cpl's cid, the sc/sct fields, the queue's queue_depth. The hardware is no longer a black box.