NVMe at the hardware level.
The previous page explained why SPDK runs in userspace. This page explains what it talks to. An NVMe device isn't a "disk" — it's a PCIe endpoint with a memory-mapped register file, a small set of in-memory ring buffers, and a doorbell to get its attention. Once you see the hardware, the rest of SPDK starts to feel inevitable.
- What "an NVMe device" actually is
- PCIe, BARs, and MMIO — how a CPU talks to a device
- The NVMe register file — what lives at the top of BAR0
- Submission queues and completion queues
- Doorbell registers: how a single write kicks the device
- An NVMe command, byte by byte
- Admin queue vs I/O queues
- MSI-X interrupts — and why SPDK turns them off
- DMA, PRPs, and SGLs: getting data into RAM
- Namespaces — one SSD, many "disks"
- Source walkthrough: the SPDK register struct
- Edge cases: what trips people up
What "an NVMe device" actually is
When you say "NVMe SSD" in casual conversation, you usually mean "a fast block-storage thing." When the kernel — or SPDK — talks to one, it means something much more specific:
A PCIe endpoint. The device is a card on a PCIe bus with its own device ID, vendor ID, and BARs (Base Address Registers). It shows up under
lspcilike any other PCIe device.A memory-mapped register file. The device exposes a region of its internal registers to the CPU by claiming one or more BAR windows. The CPU accesses them with normal load/store instructions — no port I/O, no special instructions, no syscalls.
A small DMA engine inside the device. The device can read from and write to system RAM directly, without the CPU's involvement, by performing PCIe DMA transactions.
A protocol, defined by the NVM Express specification, that describes what those registers mean and what the DMA engine should do in response. The spec is the contract; the device and driver both implement it.
That's it. There is no "NVMe driver" inside the device in the way most people imagine — just a piece of silicon that reacts to register writes and to commands the host leaves in a ring buffer. The interesting work happens because both sides agree on the spec.
PCIe, BARs, and MMIO — how a CPU talks to a device
A PCIe device is enumerated at boot. The system firmware (or, with hotplug, the kernel) walks the PCIe tree, assigns each device a Bus:Device.Function address (the BDF), and reads the device's BAR registers. Each BAR describes a window of physical addresses that the device will respond to.
The OS then programs the IOMMU (if any) and the root complex, and
mmap()s those addresses into a process's virtual address
space. From the process's point of view, the device is just a region of
memory: writing to a specific address sets a register, reading from a
specific address returns a status bit.
For an NVMe controller, the relevant BAR is BAR0. The size of BAR0 is small — typically 4 KiB to 16 KiB. Most of those bytes are the controller register file. The rest, growing toward the end of the address range, is the doorbell array: one doorbell per queue pair, and there can be thousands of queue pairs.
The NVMe register file — what lives at the top of BAR0
The first few hundred bytes of BAR0 are the NVMe controller registers.
The exact layout is mandated by the spec and mirrored in SPDK as
struct spdk_nvme_registers in
include/spdk/nvme_spec.h:534 . The fields that
matter for this primer:
| Register | Offset | What it does |
|---|---|---|
cap | 0x00 | Controller capabilities. Read-only. Tells the host what queue depths, queue counts, and features the device supports. |
vs | 0x08 | Version of the NVMe spec this controller implements. |
cc | 0x14 | Controller configuration. Host-writable. Enables the controller, sets the I/O queue entry size, etc. |
csts | 0x1C | Controller status. Device-writable. Tells the host whether the controller is ready, processing, or in a fatal state. |
aqa | 0x24 | Admin queue attributes. Size of the admin submission and completion queues. |
asq / acq | 0x28 / 0x30 | Admin submission / completion queue physical base addresses. The host puts the ring buffers in RAM and tells the device where. |
doorbell[] | 0x1000+ | One submission-queue tail doorbell + one completion-queue head doorbell per queue pair. The "kick the device" registers. |
The two most important facts to internalize are:
The admin queues are pointed to by registers (
asqandacq). Before the host can submit any I/O, it allocates two ring buffers in RAM, writes their physical addresses into those registers, and then writescc.EN = 1to bring the controller online.After that, every interaction is via ring buffers and doorbells. The spec defines how the host fills a submission queue, rings the doorbell, and the device posts a completion.
Submission queues and completion queues
Once the admin queue is set up, the host can create more queues — for
actual I/O — by submitting admin commands like
Create I/O Submission Queue and
Create I/O Completion Queue. Each created queue is
identified by a 16-bit ID, and a pair of (SQ, CQ) is called a
queue pair.
flowchart LR subgraph CPU["Host CPU"] H[SPDK process] end subgraph MEM["System RAM (DMA-capable)"] SQ["Submission Queue
spdk_nvme_cmd, 64 B each
rings: host writes, device reads"] CQ["Completion Queue
spdk_nvme_cpl, 16 B each
rings: device writes, host reads"] DATA["Data buffers (PRP / SGL)"] end subgraph PCIE["PCIe endpoint (NVMe device)"] REG["Controller registers
BAR0: cap, cc, csts, asq, acq, ..."] DB["Doorbell array
sq_tdbl, cq_hdbl per qpair"] CTRL["Controller silicon
DMA engine + queues"] end H -- "mmap BAR0" --> REG H -- "write commands into SQ slot" --> SQ H -- "write tail to doorbell" --> DB DB -- "kicks controller" --> CTRL CTRL -- "DMA data in/out" --> DATA CTRL -- "posts cpl to CQ" --> CQ H -- "poll CQ for new completions" --> CQ
fig. 1 The host and device share three pieces of memory:
submission queue (host-writable, device-readable),
completion queue (device-writable, host-readable), and
data buffers (DMA in both directions). The doorbell is
the only way the host tells the device "new commands are ready." A
typical SQ is a circular array of 64-byte spdk_nvme_cmd
entries; a typical CQ is a circular array of 16-byte
spdk_nvme_cpl entries. Both must live in physically
contiguous, DMA-addressable memory.
The shape of those ring entries is fixed by the spec. From include/spdk/nvme_spec.h:1452 :
The completion entry is shorter — 16 bytes, fixed by the spec. From include/spdk/nvme_spec.h:1519 :
Doorbell registers: how a single write kicks the device
The doorbell is the cheapest possible interface between the host and the device. The host writes a single 32-bit value to a specific MMIO address, and the device knows "the host has just placed one or more new commands in submission queue N — go look at slot T."
The "specific MMIO address" is calculated from the queue ID and the
device's dstrd (doorbell stride) capability. In SPDK, that
math is in
lib/nvme/nvme_pcie_common.c:229 :
Doorbell writes go to posted PCIe writes (writes with no
completion), which means they reach the device very quickly — a few
hundred nanoseconds on a modern PCIe link. The "ring the doorbell" idiom
you'll see all over SPDK is literally *sq_tdbl = tail; on
a volatile pointer.
An NVMe command, byte by byte
A host submits work by writing one or more spdk_nvme_cmd
entries into the submission queue, advancing the SQ tail, and ringing
the doorbell. The device posts a completion by writing one
spdk_nvme_cpl into the CQ, advancing the CQ head, and (if
interrupts are enabled) signaling an MSI-X vector.
The "ring the doorbell" line is the entire hot path. Everything else is RAM and DMA. That's why an NVMe SSD can sustain a million IOPS with a modern CPU doing nothing but touching cache lines.
Admin queue vs I/O queues
Every NVMe controller has exactly one admin queue pair
(QID 0, capacity up to 4096 entries per the spec). It is created
automatically when the controller is enabled — the host tells the device
where the admin SQ and CQ live in RAM by writing
asq/acq and setting cc.EN = 1.
The admin queue handles housekeeping:
- Identify — query the controller's identity, capabilities, namespace list.
- Create I/O Completion Queue / Submission Queue — set up the data-plane queues.
- Delete I/O CQ / SQ — tear them down.
- Set Features / Get Features — toggle things like number of queues, interrupt coalescing, arbitration.
- Format NVM, Firmware Commit, Sanitize — admin-level operations that affect the whole device.
- Get Log Page, Async Event Request — telemetry and asynchronous event delivery.
I/O queues (QID 1..N) handle the actual data movement: read, write,
flush, write-zeroes, dataset management, compare, vendor-specific
commands. The host creates them in any number up to the device's
MQES capability, picks their depth, and tells the device
via the admin queue. After that, the host never touches the admin queue
again — it stays out of the way while I/O is in flight.
MSI-X interrupts — and why SPDK turns them off
A modern NVMe device can signal the host that work is complete by
firing an MSI-X interrupt — a message-signaled
interrupt routed over PCIe to a specific CPU. The OS handles the
interrupt by running a registered handler, which (in the kernel's case)
is the NVMe driver's irq_handler.
SPDK's default mode is no interrupts. The user can
turn them on (e.g. --enable-interrupts for
spdk_nvme_perf), but the default is polling. Why?
Interrupt latency. On a busy system, the time from "device fires MSI-X" to "host code runs" is typically 5–50 µs. Polling, by contrast, can react in <1 µs.
Tail latency. Under load, the worst-case interrupt coalescing delay (the device waits a few hundred µs hoping to batch completions) makes p99.9 / p99.99 latencies bursty. Polling eliminates coalescing.
OS noise. An interrupt handler that runs in the kernel can be preempted, can be delayed by softirqs, can be migrated to another CPU. Polling is bound to one thread on one core, with nothing in between.
The next page in this primer (0.3) goes deep on this trade-off. For now, the rule is: SPDK has interrupts available — see
lib/nvme/nvme_pcie.c:1049(spdk_pci_device_disable_interrupts) and
(spdk_nvme_poll_group_set_interrupt_callback) — but the
fast path doesn't use them.
DMA, PRPs, and SGLs: getting data into RAM
The device moves data with PCIe DMA — it can read from and write to system RAM directly, as long as the host has set up the IOMMU and given the device a bus address to use. There is no "send me a buffer" control command; the host just hands the device a list of physical addresses in the submission queue entry, and the device DMAs to/from them.
NVMe supports two ways to describe data buffers:
| Mechanism | What it is | When you'd use it |
|---|---|---|
| PRP (Physical Region Page) | One or two physical page addresses (4 KiB each, optionally chained via a list pointer). | Default for reads and writes. PRP1 + PRP2 cover up to two pages directly; for larger I/O, PRP2 points to a PRP list. |
| SGL (Scatter-Gather List) | A flexible descriptor chain — can be a data block, a bit bucket, a segment, or a key for encryption. | Vendor-specific features, complex I/O patterns, key-value style commands (e.g. ZNS, OCP). |
The psdt field in the SQ entry tells the device which one
to interpret dptr as. The choice is per-command, and a
given device may support only one of them. Most reads and writes use
PRP.
Namespaces — one SSD, many "disks"
An NVMe controller may expose one or more namespaces (NSID 1..N). A namespace is a range of logical blocks with a single LBA format. A single physical SSD can present:
- One namespace covering the whole device (typical consumer drives).
- Many namespaces, each a slice of the device, with independent LBA formats and block sizes (typical enterprise drives; ZNS SSDs).
- No namespaces at all during early boot — the host has to issue
Identify Namespaceto discover them.
From the host's point of view, each namespace is independent: a
read to NSID 5 doesn't interact with a read to
NSID 7. The I/O command's nsid field selects which
namespace the operation targets.
In SPDK, each namespace is a struct spdk_nvme_ns, and you
call spdk_nvme_ns_cmd_read(ns, qpair, buf, lba, lba_count, cb, arg).
Underneath, that helper populates the SQ entry's nsid with
the namespace's ID, sets cdw10/cdw11 to the
starting LBA and the LBA count, and rings the doorbell.
Source walkthrough: how SPDK actually rings a doorbell
Here is the part of the data path that turns a
spdk_nvme_ns_cmd_read call into an MMIO write. It is small
enough to read in one sitting, and it captures all the ideas from this
page:
Edge cases & what trips people up
1. What if you submit to a full queue?
The queue is "full" when (sq_tail + 1) % num_entries == sq_head.
In SPDK, this is tracked by pqpair->qpair.queue_depth, and
the public API guarantees you get -ENOMEM back if you try
to submit when depth is at the cap. The device will not reject
your submit; it will silently accept it and post an "async event" or
behave undefined. The host has to police this — and SPDK does, via
trackers.
2. What if the completion comes before you ring the doorbell?
It can't. The device only posts a completion after it has
fetched the command, which it only does after the doorbell. But
completions can arrive in any order relative to other
submissions on the same qpair, because the device may reorder them
internally for performance. The host matches command to completion by
the cid field, not by order.
3. The phase tag (the single most subtle thing in NVMe)
The CQ is a circular buffer. The host advances cq_head to
consume completions. The device advances cq_tail to post
new ones. When the queue wraps, both pointers reset. Without a flag,
the host can't tell "new completion" from "stale slot."
The p bit in the completion's status word is
the flag. The device sets p to 1 on even
passes around the queue and 0 on odd passes. The host
remembers the last p it saw; if the next cpl has a
different p, the queue has wrapped and the slot
is fresh. If the p matches what was already there, it's a
stale entry.
4. Hot-remove while I/O is in flight
A PCIe NVMe device can be unplugged at any time. The next MMIO access
to BAR0 — including a doorbell write — would normally raise a
SIGBUS and kill the process. SPDK installs a
SIGBUS handler that detects this case and remaps BAR0 to a
placeholder region. After that, all I/O completes with an error
(rather than crashing the process). The flow is documented in
doc/nvme.md §Hotplug; the implementation lives in
the PCIe transport, around the BAR-mapping code in
lib/nvme/nvme_pcie.c:770 .
5. The admin queue is QID 0 — forever
You can delete I/O queues, but the admin queue is permanent. It is set
up at cc.EN = 1 time and torn down at controller reset.
If your code accidentally tries to delete QID 0, SPDK will refuse; if
you bypass the API and tell the device to do it, you will lose the
ability to do anything else, and the only recovery is a controller
reset.
6. The NVMe spec is the spec, not the device
A given device may support only a subset of the spec, may interpret
fields in a vendor-specific way, and may have quirks. SPDK
maintains a quirks database in lib/nvme/nvme_quirks.c and
applies it during controller construction. The spec tells you what
can happen; the device tells you what does happen.
Trust the spec for protocol, but always check device-specific behavior
before assuming it works.
Why it matters
When you read later explainers — reactor, bdev, lvol, nvmf — you will
see words like qpair, tracker, doorbell,
PRP, namespace, phase tag. They are not
abstractions on top of NVMe; they are NVMe. Every bdev module that
issues I/O eventually calls into nvme_pcie_qpair_submit_tracker
(or a fabric-transport cousin of it). The whole bdev framework is a way
to multiplex many logical "disks" onto a small set of physical NVMe
queue pairs, one per CPU core, all submitting by ringing doorbells.
When you see a hung submission, you know what to check: the SQ tail,
the doorbell, the device's CQ head, the phase tag. When you see a
completion you don't expect, you know what to check: the cpl's
cid, the sc/sct fields, the
queue's queue_depth. The hardware is no longer a black box.