Layer 0 · Prerequisite primer

NVMe at the hardware level.

The previous page explained why SPDK runs in userspace. This page explains what it talks to. An NVMe device isn't a "disk" — it's a PCIe endpoint with a memory-mapped register file, a small set of in-memory ring buffers, and a doorbell to get its attention. Once you see the hardware, the rest of SPDK starts to feel inevitable.

~12 min read1 diagramprerequisite: 0.1 — Why userspace I/O exists

On this page

What "an NVMe device" actually is
PCIe, BARs, and MMIO — how a CPU talks to a device
The NVMe register file — what lives at the top of BAR0
Submission queues and completion queues
Doorbell registers: how a single write kicks the device
An NVMe command, byte by byte
Admin queue vs I/O queues
MSI-X interrupts — and why SPDK turns them off
DMA, PRPs, and SGLs: getting data into RAM
Namespaces — one SSD, many "disks"
Source walkthrough: the SPDK register struct
Edge cases: what trips people up

What "an NVMe device" actually is

When you say "NVMe SSD" in casual conversation, you usually mean "a fast block-storage thing." When the kernel — or SPDK — talks to one, it means something much more specific:

A PCIe endpoint. The device is a card on a PCIe bus with its own device ID, vendor ID, and BARs (Base Address Registers). It shows up under lspci like any other PCIe device.
A memory-mapped register file. The device exposes a region of its internal registers to the CPU by claiming one or more BAR windows. The CPU accesses them with normal load/store instructions — no port I/O, no special instructions, no syscalls.
A small DMA engine inside the device. The device can read from and write to system RAM directly, without the CPU's involvement, by performing PCIe DMA transactions.
A protocol, defined by the NVM Express specification, that describes what those registers mean and what the DMA engine should do in response. The spec is the contract; the device and driver both implement it.

That's it. There is no "NVMe driver" inside the device in the way most people imagine — just a piece of silicon that reacts to register writes and to commands the host leaves in a ring buffer. The interesting work happens because both sides agree on the spec.

PCIe, BARs, and MMIO — how a CPU talks to a device

A PCIe device is enumerated at boot. The system firmware (or, with hotplug, the kernel) walks the PCIe tree, assigns each device a Bus:Device.Function address (the BDF), and reads the device's BAR registers. Each BAR describes a window of physical addresses that the device will respond to.

The OS then programs the IOMMU (if any) and the root complex, and mmap()s those addresses into a process's virtual address space. From the process's point of view, the device is just a region of memory: writing to a specific address sets a register, reading from a specific address returns a status bit.

For an NVMe controller, the relevant BAR is BAR0. The size of BAR0 is small — typically 4 KiB to 16 KiB. Most of those bytes are the controller register file. The rest, growing toward the end of the address range, is the doorbell array: one doorbell per queue pair, and there can be thousands of queue pairs.

The NVMe register file — what lives at the top of BAR0

The first few hundred bytes of BAR0 are the NVMe controller registers. The exact layout is mandated by the spec and mirrored in SPDK as struct spdk_nvme_registers in include/spdk/nvme_spec.h:534 . The fields that matter for this primer:

Register	Offset	What it does
`cap`	0x00	Controller capabilities. Read-only. Tells the host what queue depths, queue counts, and features the device supports.
`vs`	0x08	Version of the NVMe spec this controller implements.
`cc`	0x14	Controller configuration. Host-writable. Enables the controller, sets the I/O queue entry size, etc.
`csts`	0x1C	Controller status. Device-writable. Tells the host whether the controller is ready, processing, or in a fatal state.
`aqa`	0x24	Admin queue attributes. Size of the admin submission and completion queues.
`asq` / `acq`	0x28 / 0x30	Admin submission / completion queue physical base addresses. The host puts the ring buffers in RAM and tells the device where.
`doorbell[]`	0x1000+	One submission-queue tail doorbell + one completion-queue head doorbell per queue pair. The "kick the device" registers.

The two most important facts to internalize are:

The admin queues are pointed to by registers (asq and acq). Before the host can submit any I/O, it allocates two ring buffers in RAM, writes their physical addresses into those registers, and then writes cc.EN = 1 to bring the controller online.
After that, every interaction is via ring buffers and doorbells. The spec defines how the host fills a submission queue, rings the doorbell, and the device posts a completion.

Submission queues and completion queues

Once the admin queue is set up, the host can create more queues — for actual I/O — by submitting admin commands like Create I/O Submission Queue and Create I/O Completion Queue. Each created queue is identified by a 16-bit ID, and a pair of (SQ, CQ) is called a queue pair.

flowchart LR
subgraph CPU["Host CPU"]
  H[SPDK process]
end

subgraph MEM["System RAM (DMA-capable)"]
  SQ["Submission Queue
spdk_nvme_cmd, 64 B each
rings: host writes, device reads"]
  CQ["Completion Queue
spdk_nvme_cpl, 16 B each
rings: device writes, host reads"]
  DATA["Data buffers (PRP / SGL)"]
end

subgraph PCIE["PCIe endpoint (NVMe device)"]
  REG["Controller registers
BAR0: cap, cc, csts, asq, acq, ..."]
  DB["Doorbell array
sq_tdbl, cq_hdbl per qpair"]
  CTRL["Controller silicon
DMA engine + queues"]
end

H -- "mmap BAR0" --> REG
H -- "write commands into SQ slot" --> SQ
H -- "write tail to doorbell" --> DB
DB -- "kicks controller" --> CTRL
CTRL -- "DMA data in/out" --> DATA
CTRL -- "posts cpl to CQ" --> CQ
H -- "poll CQ for new completions" --> CQ

fig. 1 — A PCIe NVMe controller, memory-mapped · tap or scroll to zoom · ↗ for fullscreen

fig. 1 The host and device share three pieces of memory: submission queue (host-writable, device-readable), completion queue (device-writable, host-readable), and data buffers (DMA in both directions). The doorbell is the only way the host tells the device "new commands are ready." A typical SQ is a circular array of 64-byte spdk_nvme_cmd entries; a typical CQ is a circular array of 16-byte spdk_nvme_cpl entries. Both must live in physically contiguous, DMA-addressable memory.

The shape of those ring entries is fixed by the spec. From include/spdk/nvme_spec.h:1452 :

spdk_v26_01_migration/include/spdk/nvme_spec.h · lines 1452-1504 struct spdk_nvme_cmd — 64 bytes, exactly the size the spec mandates

A submission queue entry is precisely 64 bytes. Bit-fields at the top (opc, fuse, psdt) tell the device what to do and how. The middle carries the data pointer (PRP1, PRP2, or an SGL descriptor). The bottom six dwords are command-specific parameters — start LBA, length, namespace ID, and so on.

struct spdk_nvme_cmd {
    /* dword 0 */
    uint16_t opc    :  8;  /* opcode: read, write, flush, identify, ... */
    uint16_t fuse   :  2;  /* fused operation (compare+write) */
    uint16_t rsvd1  :  4;
    uint16_t psdt   :  2;  /* PRP or SGL */
    uint16_t cid;          /* command identifier — echoed in cpl */

    /* dword 1 */
    uint32_t nsid;         /* namespace ID (1..N for I/O) */

    /* dword 4-5 */
    uint64_t mptr;         /* metadata pointer */

    /* dword 6-9: data pointer */
    union {
        struct { uint64_t prp1; uint64_t prp2; } prp;
        struct spdk_nvme_sgl_descriptor sgl1;
    } dptr;

    /* dword 10-15: command-specific */
    uint32_t cdw10, cdw11, cdw12, cdw13, cdw14, cdw15;
};

The completion entry is shorter — 16 bytes, fixed by the spec. From include/spdk/nvme_spec.h:1519 :

spdk_v26_01_migration/include/spdk/nvme_spec.h · lines 1519-1537 struct spdk_nvme_cpl — 16 bytes, the device's response

The status field is bit-packed. p is the phase tag — it flips between 0 and 1 every time the CQ wraps, which is how the host distinguishes "new completion" from "stale slot I already processed." This is the single most subtle thing in NVMe; you'll see it in the source walkthrough.

struct spdk_nvme_cpl {
    uint32_t cdw0;          /* command-specific (e.g. read data) */
    uint32_t cdw1;          /* command-specific */
    uint16_t sqhd;          /* submission queue head pointer */
    uint16_t sqid;          /* submission queue identifier */
    uint16_t cid;           /* echoes the command's cid */
    union {
        uint16_t status_raw;
        struct spdk_nvme_status {
            uint16_t p    :  1;  /* phase tag */
            uint16_t sc   :  8;  /* status code */
            uint16_t sct  :  3;  /* status code type */
            uint16_t crd  :  2;  /* command retry delay */
            uint16_t m    :  1;  /* more */
            uint16_t dnr  :  1;  /* do not retry */
        } status;
    };
};

Doorbell registers: how a single write kicks the device

The doorbell is the cheapest possible interface between the host and the device. The host writes a single 32-bit value to a specific MMIO address, and the device knows "the host has just placed one or more new commands in submission queue N — go look at slot T."

The "specific MMIO address" is calculated from the queue ID and the device's dstrd (doorbell stride) capability. In SPDK, that math is in lib/nvme/nvme_pcie_common.c:229 :

spdk_v26_01_migration/lib/nvme/nvme_pcie_common.c · lines 229-230 Doorbell address = base + (2*qid + 0 or 1) * stride

When SPDK creates a queue pair, it computes the doorbell addresses once and stashes them in the queue pair struct. Every submission is one MMIO write to sq_tdbl; every completion-processing pass is one MMIO write to cq_hdbl.

pqpair->sq_tdbl = pctrlr->doorbell_base + (2 * qpair->id + 0) * pctrlr->doorbell_stride_u32;
pqpair->cq_hdbl = pctrlr->doorbell_base + (2 * qpair->id + 1) * pctrlr->doorbell_stride_u32;

Doorbell writes go to posted PCIe writes (writes with no completion), which means they reach the device very quickly — a few hundred nanoseconds on a modern PCIe link. The "ring the doorbell" idiom you'll see all over SPDK is literally *sq_tdbl = tail; on a volatile pointer.

An NVMe command, byte by byte

A host submits work by writing one or more spdk_nvme_cmd entries into the submission queue, advancing the SQ tail, and ringing the doorbell. The device posts a completion by writing one spdk_nvme_cpl into the CQ, advancing the CQ head, and (if interrupts are enabled) signaling an MSI-X vector.

STEP 01

1. Build cmd

fill in opcode, nsid, LBA, PRP/SGL

→

STEP 02

2. Copy to SQ

into slot (sq_tail % num_entries)

→

STEP 03

3. Advance tail

sq_tail = (sq_tail + 1) % num_entries

→

STEP 04

4. Ring doorbell

MMIO write of new tail to sq_tdbl

→

STEP 05

5. Device fetches

DMA-read of the cmd from RAM

→

STEP 06

6. Device executes

does the actual I/O (or admin op)

→

STEP 07

7. Device posts cpl

DMA-write of spdk_nvme_cpl to CQ

→

STEP 08

8. (Optional) IRQ

MSI-X if interrupts enabled

→

STEP 09

9. Host polls CQ

or gets woken by IRQ

→

STEP 10

10. Host reaps cpl

advance cq_head, ring cq_hdbl

The "ring the doorbell" line is the entire hot path. Everything else is RAM and DMA. That's why an NVMe SSD can sustain a million IOPS with a modern CPU doing nothing but touching cache lines.

Admin queue vs I/O queues

Every NVMe controller has exactly one admin queue pair (QID 0, capacity up to 4096 entries per the spec). It is created automatically when the controller is enabled — the host tells the device where the admin SQ and CQ live in RAM by writing asq/acq and setting cc.EN = 1.

The admin queue handles housekeeping:

Identify — query the controller's identity, capabilities, namespace list.
Create I/O Completion Queue / Submission Queue — set up the data-plane queues.
Delete I/O CQ / SQ — tear them down.
Set Features / Get Features — toggle things like number of queues, interrupt coalescing, arbitration.
Format NVM, Firmware Commit, Sanitize — admin-level operations that affect the whole device.
Get Log Page, Async Event Request — telemetry and asynchronous event delivery.

I/O queues (QID 1..N) handle the actual data movement: read, write, flush, write-zeroes, dataset management, compare, vendor-specific commands. The host creates them in any number up to the device's MQES capability, picks their depth, and tells the device via the admin queue. After that, the host never touches the admin queue again — it stays out of the way while I/O is in flight.

MSI-X interrupts — and why SPDK turns them off

A modern NVMe device can signal the host that work is complete by firing an MSI-X interrupt — a message-signaled interrupt routed over PCIe to a specific CPU. The OS handles the interrupt by running a registered handler, which (in the kernel's case) is the NVMe driver's irq_handler.

SPDK's default mode is no interrupts. The user can turn them on (e.g. --enable-interrupts for spdk_nvme_perf), but the default is polling. Why?

Interrupt latency. On a busy system, the time from "device fires MSI-X" to "host code runs" is typically 5–50 µs. Polling, by contrast, can react in <1 µs.
Tail latency. Under load, the worst-case interrupt coalescing delay (the device waits a few hundred µs hoping to batch completions) makes p99.9 / p99.99 latencies bursty. Polling eliminates coalescing.
OS noise. An interrupt handler that runs in the kernel can be preempted, can be delayed by softirqs, can be migrated to another CPU. Polling is bound to one thread on one core, with nothing in between.

The next page in this primer (0.3) goes deep on this trade-off. For now, the rule is: SPDK has interrupts available — see

lib/nvme/nvme_pcie.c:1049

(spdk_pci_device_disable_interrupts) and

lib/nvme/nvme_poll_group.c:87

(spdk_nvme_poll_group_set_interrupt_callback) — but the fast path doesn't use them.

DMA, PRPs, and SGLs: getting data into RAM

The device moves data with PCIe DMA — it can read from and write to system RAM directly, as long as the host has set up the IOMMU and given the device a bus address to use. There is no "send me a buffer" control command; the host just hands the device a list of physical addresses in the submission queue entry, and the device DMAs to/from them.

NVMe supports two ways to describe data buffers:

Mechanism	What it is	When you'd use it
PRP (Physical Region Page)	One or two physical page addresses (4 KiB each, optionally chained via a list pointer).	Default for reads and writes. PRP1 + PRP2 cover up to two pages directly; for larger I/O, PRP2 points to a PRP list.
SGL (Scatter-Gather List)	A flexible descriptor chain — can be a data block, a bit bucket, a segment, or a key for encryption.	Vendor-specific features, complex I/O patterns, key-value style commands (e.g. ZNS, OCP).

The psdt field in the SQ entry tells the device which one to interpret dptr as. The choice is per-command, and a given device may support only one of them. Most reads and writes use PRP.

Namespaces — one SSD, many "disks"

An NVMe controller may expose one or more namespaces (NSID 1..N). A namespace is a range of logical blocks with a single LBA format. A single physical SSD can present:

One namespace covering the whole device (typical consumer drives).
Many namespaces, each a slice of the device, with independent LBA formats and block sizes (typical enterprise drives; ZNS SSDs).
No namespaces at all during early boot — the host has to issue Identify Namespace to discover them.

From the host's point of view, each namespace is independent: a read to NSID 5 doesn't interact with a read to NSID 7. The I/O command's nsid field selects which namespace the operation targets.

In SPDK, each namespace is a struct spdk_nvme_ns, and you call spdk_nvme_ns_cmd_read(ns, qpair, buf, lba, lba_count, cb, arg). Underneath, that helper populates the SQ entry's nsid with the namespace's ID, sets cdw10/cdw11 to the starting LBA and the LBA count, and rings the doorbell.

Source walkthrough: how SPDK actually rings a doorbell

Here is the part of the data path that turns a spdk_nvme_ns_cmd_read call into an MMIO write. It is small enough to read in one sitting, and it captures all the ideas from this page:

spdk_v26_01_migration/lib/nvme/nvme_pcie_common.c · lines 658-702 nvme_pcie_qpair_submit_tracker — the moment a request goes on the wire

By the time this function is called, a tracker (tr) has already been filled with a fully-formed spdk_nvme_cmd and the request (req) is ready to go. Three things happen:

void
nvme_pcie_qpair_submit_tracker(struct spdk_nvme_qpair *qpair, struct nvme_tracker *tr)
{
    struct nvme_request	*req;
    struct nvme_pcie_qpair	*pqpair = nvme_pcie_qpair(qpair);
    struct spdk_nvme_ctrlr	*ctrlr = qpair->ctrlr;

    req = tr->req;
    assert(req != NULL);

    spdk_trace_record(TRACE_NVME_PCIE_SUBMIT, qpair->id, 0,
                      (uintptr_t)req, req->cb_arg,
                      (uint32_t)req->cmd.cid, (uint32_t)req->cmd.opc,
                      req->cmd.cdw10, req->cmd.cdw11, req->cmd.cdw12,
                      pqpair->qpair.queue_depth);

    if (req->cmd.fuse) {
        qpair->last_fuse = req->cmd.fuse;
    }

    /* Copy the command from the tracker to the submission queue. */
    nvme_pcie_copy_command(&pqpair->cmd[pqpair->sq_tail], &req->cmd);

    if (spdk_unlikely(++pqpair->sq_tail == pqpair->num_entries)) {
        pqpair->sq_tail = 0;
    }

    if (spdk_unlikely(pqpair->sq_tail == pqpair->sq_head)) {
        NVME_QPAIR_ERRLOG(qpair, "sq_tail is passing sq_head!\n");
    }

    if (!pqpair->flags.delay_cmd_submit) {
        nvme_pcie_qpair_ring_sq_doorbell(qpair);
    }
}

What you should notice:

Line ~688 — the 64-byte command is copied (with streaming stores, for performance) into the SQ slot at sq_tail. The slot is a regular RAM location in pqpair->cmd[], which is DMA-addressable (it was allocated with spdk_zmalloc(... SPDK_MALLOC_DMA) earlier in lib/nvme/nvme_pcie_common.c:186 ).
Lines ~691–693 — sq_tail advances and wraps. The host can use all 64 Ki slots in a single queue; the device only ever sees sq_tail via the doorbell.
Lines ~695–697 — defensive check: if sq_tail catches up to sq_head, the queue is full and a wraparound bug would clobber in-flight commands. The error log fires; this should never happen in production because callers check queue_depth before submitting.
Lines ~699–701 — unless delay_cmd_submit is on (used for batching), the SQ doorbell is rung: a single 32-bit MMIO write to pqpair->sq_tdbl. That is the entire hot path. One store. To a memory address. The device picks it up from there.

Edge cases & what trips people up

1. What if you submit to a full queue?

The queue is "full" when (sq_tail + 1) % num_entries == sq_head. In SPDK, this is tracked by pqpair->qpair.queue_depth, and the public API guarantees you get -ENOMEM back if you try to submit when depth is at the cap. The device will not reject your submit; it will silently accept it and post an "async event" or behave undefined. The host has to police this — and SPDK does, via trackers.

2. What if the completion comes before you ring the doorbell?

It can't. The device only posts a completion after it has fetched the command, which it only does after the doorbell. But completions can arrive in any order relative to other submissions on the same qpair, because the device may reorder them internally for performance. The host matches command to completion by the cid field, not by order.

3. The phase tag (the single most subtle thing in NVMe)

The CQ is a circular buffer. The host advances cq_head to consume completions. The device advances cq_tail to post new ones. When the queue wraps, both pointers reset. Without a flag, the host can't tell "new completion" from "stale slot."

The p bit in the completion's status word is the flag. The device sets p to 1 on even passes around the queue and 0 on odd passes. The host remembers the last p it saw; if the next cpl has a different p, the queue has wrapped and the slot is fresh. If the p matches what was already there, it's a stale entry.

4. Hot-remove while I/O is in flight

A PCIe NVMe device can be unplugged at any time. The next MMIO access to BAR0 — including a doorbell write — would normally raise a SIGBUS and kill the process. SPDK installs a SIGBUS handler that detects this case and remaps BAR0 to a placeholder region. After that, all I/O completes with an error (rather than crashing the process). The flow is documented in doc/nvme.md §Hotplug; the implementation lives in the PCIe transport, around the BAR-mapping code in lib/nvme/nvme_pcie.c:770 .

5. The admin queue is QID 0 — forever

You can delete I/O queues, but the admin queue is permanent. It is set up at cc.EN = 1 time and torn down at controller reset. If your code accidentally tries to delete QID 0, SPDK will refuse; if you bypass the API and tell the device to do it, you will lose the ability to do anything else, and the only recovery is a controller reset.

6. The NVMe spec is the spec, not the device

A given device may support only a subset of the spec, may interpret fields in a vendor-specific way, and may have quirks. SPDK maintains a quirks database in lib/nvme/nvme_quirks.c and applies it during controller construction. The spec tells you what can happen; the device tells you what does happen. Trust the spec for protocol, but always check device-specific behavior before assuming it works.

Why it matters

When you read later explainers — reactor, bdev, lvol, nvmf — you will see words like qpair, tracker, doorbell, PRP, namespace, phase tag. They are not abstractions on top of NVMe; they are NVMe. Every bdev module that issues I/O eventually calls into nvme_pcie_qpair_submit_tracker (or a fabric-transport cousin of it). The whole bdev framework is a way to multiplex many logical "disks" onto a small set of physical NVMe queue pairs, one per CPU core, all submitting by ringing doorbells.

When you see a hung submission, you know what to check: the SQ tail, the doorbell, the device's CQ head, the phase tag. When you see a completion you don't expect, you know what to check: the cpl's cid, the sc/sct fields, the queue's queue_depth. The hardware is no longer a black box.