Layer 6 · nvmf

The four objects, and the protocol that connects them.

NVMe-over-Fabrics takes the NVMe command set you already know from Layer 0.2 and tunnels it over a network. The protocol adds four new nouns that don't exist in PCIe-attached NVMe: subsystem, namespace, listener, and controller. This page is the vocabulary you'll need for the rest of Layer 6.

~15 min read2 diagramsprerequisites: 0.2 · 4.1
On this page
  1. What NVMe-oF actually does
  2. Initiator and target, end to end
  3. Subsystem — the "SCSI target" of NVMe-oF
  4. Namespace — one bdev, one NSID
  5. Listener — the network endpoint on the target
  6. Controller — the initiator-side state machine
  7. Discovery service — how initiators find targets
  8. Host NQN — the iSCSI IQN equivalent
  9. Admin queue vs I/O queues, across the network
  10. Edge cases & what trips people up

What NVMe-oF actually does

PCIe-attached NVMe puts the NVMe controller and the host's NVMe driver on the same machine, talking over a PCIe bus. The commands travel in a submission queue, completions come back in a completion queue, and that's it. NVMe-over-Fabrics — NVMe-oF for short — replaces that PCIe bus with a network. The host ("initiator") and the storage device ("target") are now two different processes, often on two different hosts, talking TCP or RDMA. The NVMe command set, the submission/completion queue abstraction, and the register interface are unchanged. What changes is how the bytes get from one side to the other.

The spec that defines this is the NVMe-oF specification, TP 8013 maintained by the NVMe Inc. technical work group. It is layered on top of the base NVMe spec. Every transport (RDMA, TCP, FC, VFIO-user) speaks the same wire-level command capsules and the same connection setup; only the framing differs. SPDK implements the target side in lib/nvmf/:lib/ .

Initiator and target, end to end

The protocol has exactly two roles, named the same way iSCSI names them:

RoleWhat it doesSPDK componentLinux counterpart
TargetOwns the bdevs. Serves them to the network. Accepts connections.nvmf_tgt binary, the SPDK lib/nvmf/ libraryNone — there's no in-kernel NVMe-oF target worth using. SPDK is the target.
InitiatorConnects to targets. Discovers them. Submits NVMe commands on behalf of an application.(no SPDK component — see Layer 0)Kernel nvme_tcp or nvme_rdma driver; the nvme-cli userland tool

The asymmetry is the thing to internalize: SPDK only ever implements the target side. An SPDK process does not connect to an NVMe-oF target. It does, however, speak PCIe NVMe as a client of the in-kernel driver, which is a different protocol that happens to share the command set. The targets we build with diskengine and the SPDK nvmf_tgt are consumed by either the kernel nvme_tcp/nvme_rdma driver or by another SPDK process running in "PCIe NVMe client" mode.

flowchart LR
subgraph "Initiator host"
  APP[Application
issues read/write syscalls] DRV[Kernel NVMe-OF driver
nvme_tcp or nvme_rdma] APP --> DRV end subgraph "Target host (SPDK)" NVMF[nvmf_tgt
the SPDK target] BDEV[bdev stack
malloc / nvme / lvol / passthru] NVMF --> BDEV end DRV -- "RDMA verbs or TCP/TLS" --> NVMF
fig. 1 — initiator/target roles · tap or scroll to zoom · ↗ for fullscreen

fig. 1   The protocol is two-sided. The kernel driver on the left submits NVMe capsules over RDMA or TCP. The nvmf_tgt process on the right translates them into bdev I/O. The application sees a normal /dev/nvmeXnY device and never knows the storage is on another machine.

Subsystem — the "SCSI target" of NVMe-oF

A subsystem is the unit of export. It is a named collection of namespaces, with a list of allowed listeners and a list of allowed hosts. If you're coming from iSCSI, think of it as the iSCSI target: the thing you log in to.

In SPDK, the subsystem is a C struct with state, an NQN, a list of namespaces, a list of listeners, a list of hosts, and a list of currently connected controllers. Creation goes through spdk_nvmf_subsystem_create — the function that lives behind the JSON-RPC method nvmf_create_subsystem.

  1. subnqn — the subsystem's NVMe Qualified Name. The string the initiator uses to identify which subsystem it wants to connect to. Same format rules as an iSCSI IQN. The well-known discovery NQN is nqn.2014-08.org.nvmexpress.discovery.

  2. state — one of INACTIVE, ACTIVE, or PAUSED. You can only mutate a subsystem in INACTIVE or PAUSED. This is what makes namespace changes safe — you pause the subsystem, change it, resume it. See the spdk_nvmf_subsystem_add_ns_ext precondition check at lib/nvmf/subsystem.c:2243 .

  3. ctrlrs — every controller currently associated with this subsystem. Each connecting initiator gets one controller. When the last qpair for a controller goes away, the controller is destroyed. See lib/nvmf/nvmf.c:1442 for the "last qpair tears down the ctrlr" path.

  4. ns — an array of namespace pointers, indexed by NSID. One per NSID, up to max_nsid. Created on spdk_nvmf_subsystem_add_ns_ext at lib/nvmf/subsystem.c:2229 .

  5. next_cntlid — the controller ID to hand out next. Each connecting initiator gets a unique 16-bit ID, allocated from the min_cntlid..max_cntlid range.

Namespace — one bdev, one NSID

A namespace is what NVMe-oF calls the exposed block device. The unit of exposure is one bdev. When you call spdk_nvmf_subsystem_add_ns_ext, the subsystem spdk_bdev_open_ext_v2's the named bdev read-write and stores the resulting spdk_bdev_desc inside an spdk_nvmf_ns. The NSID is allocated from a free slot in the subsystem's namespace array.

The mapping is one-to-one. Two namespaces cannot point to the same bdev, and one namespace cannot expose two bdevs. If you want RAID or concatenation, you do it in the bdev layer (a lvol sits on top of an lvstore; a vbdev_passthru is one to one). diskengine's wiring is at NvmfSubsystemAddNs:114 .

Listener — the network endpoint on the target

A listener is the address a target listens on. It's a transport ID: trtype (RDMA / TCP / VFIOUSER / FC), adrfam (IPv4 / IPv6), traddr (the IP, or for FC, the WWN), and trsvcid (the port number, or for FC, the route).

Listeners belong to a subsystem, not to a target. Two subsystems can each have a listener on the same target, even on the same transport address — they're tagged by NQN, not by address. The acceptor poller is at the target level; the per-subsystem filter happens when a connection is established.

The function is short, but the work underneath is the meat. A listener is added in three phases: the subsystem is paused (so no new connections arrive), the transport actually starts listening on the trid, and the subsystem is resumed. The transport's listen_associate hook is where the work gets transport-specific — for RDMA it opens a listening CM ID; for TCP it opens a socket; for VFIO-user it sets up a vfio-user server endpoint.

The opts argument is transport-specific: secure channel (TLS) for TCP, ANA state, the socket implementation selector, etc. The header is at include/spdk/nvmf.h:908 .

Controller — the initiator-side state machine

A controller is the NVMe-oF equivalent of an "I_T nexus" in SCSI. It's the runtime state for one connecting initiator on one subsystem: an admin queue, a set of I/O qpairs, the negotiated limits, the controller ID, the host NQN, and the keep-alive timer.

A controller is created by the Fabrics Connect command. The flow is: initiator sends a Connect capsule on the admin qpair, the target's nvmf_ctrlr_create at lib/nvmf/ctrlr.c:438 builds a fresh spdk_nvmf_ctrlr, allocates a controller ID, sets up the admin queue, and replies with a Connect response.

  1. dynamic_ctrlr — true for fabric transports (RDMA/TCP), false for direct-attach transports (PCIe, VFIO-user). For fabric transports, the controller ID is allocated by the target at connect time. For non-fabric transports, the controller ID comes from the host's cntlid field.

  2. thread — the SPDK thread the controller lives on. Always the qpair's poll group thread. This is the single constraint that keeps the controller lock-free: every operation on a controller runs on its thread. Cross-thread operations need a spdk_thread_send_msg.

  3. qpair_mask — a bit per qpair, used to find the last qpair when a controller is tearing down. When the last bit clears, the controller is destroyed. See lib/nvmf/nvmf.c:1442 for the "last qpair, free the ctrlr" path.

Discovery service — how initiators find targets

An initiator needs to know what subsystems exist before it can connect. NVMe-oF handles this with a special pre-defined subsystem called the discovery subsystem. Its NQN is fixed: nqn.2014-08.org.nvmexpress.discovery.

Connect to the discovery subsystem and send Get Log Page with LID 0x01. The target returns a discovery log page: a list of all the subsystems it knows about, with their NQN, transport type, and the addresses where they can be reached.

Note the allow_any_host = true. Discovery is meant to be open to all initiators — otherwise how would you find anything? The discovery subsystem also can't have namespaces (the num_ns argument is 0, and the create function at lib/nvmf/subsystem.c:240 enforces this).

sequenceDiagram
participant Init as Initiator (kernel driver)
participant DiscTgt as Target's discovery svc
participant RegTgt as Target's regular subsystem

Init->>DiscTgt: nvme connect (subnqn = nqn.2014-08...discovery, traddr = 10.0.0.1, RDMA)
DiscTgt-->>Init: connect OK
Init->>DiscTgt: Get Log Page 0x01
DiscTgt-->>Init: discovery log: [nqn-A on 10.0.0.1:4420 RDMA, nqn-B on 10.0.0.1:4421 RDMA]
Init->>RegTgt: nvme connect (subnqn = nqn-A, traddr = 10.0.0.1, RDMA)
RegTgt-->>Init: connect OK, ctrlr created
fig. 2 — discovery flow · tap or scroll to zoom · ↗ for fullscreen

fig. 2   The initiator first connects to the discovery service (the fixed NQN), pulls a list of available subsystems, then connects to whichever one it actually wants. The discovery controller has no namespaces; it only serves the log page.

Host NQN — the iSCSI IQN equivalent

Every initiator has an identifier, just like every iSCSI initiator has an IQN. It's called the Host NQN and it goes into the connect_data at connect time. The target uses it to decide whether the initiator is allowed to talk to the subsystem.

By default, a subsystem is locked to a list of allowed Host NQNs — spdk_nvmf_subsystem_add_host. Only initiators in the list (or those allowed by the global allow_any_host = true) can connect. The well-known discovery NQN is normally open to everyone; production I/O subsystems should be locked down.

The Host NQN lives in /etc/nvme/hostnqn on a Linux initiator. It's persistent across reboots and uniquely identifies a single physical machine. For a VM, it's whatever was set when the VM was provisioned; in some hyperscaler environments it's derived from the VM UUID.

Admin queue vs I/O queues, across the network

NVMe-oF keeps NVMe's two-queue model. Every connection carries one admin queue and zero or more I/O queues.

STEP 01
TCP/RDMA Connect
synchronous transport-level handshake
STEP 02
Fabrics Connect
admin qpair creates the controller
STEP 03
Property Get/Set
negotiate queue count, size, etc.
STEP 04
Create I/O Qpairs
controller grows I/O qpairs as requested
STEP 05
I/O commands
submit + complete on I/O qpairs
STEP 06
Keep Alive
admin qpair periodic, prevents timeout
  1. Transport-level connect. For TCP this is a socket accept(). For RDMA this is an RDMA CM RECV + SEND exchange. For VFIO-user it's a connection to the vfio-user socket. The transport creates a "qpair" object but it has no NVMe state yet.

  2. Fabrics Connect. The initiator sends the first NVMe capsule on this qpair: a Connect command with sqes, recfmt, the subsystem NQN, the host NQN, and the controller ID (or a request for a dynamic one). The target validates the host NQN against the subsystem's allow list, creates a spdk_nvmf_ctrlr, and replies.

  3. Property Get/Set. Admin commands. The initiator asks for the controller's limits and tunes its queue count, queue size, and the keep-alive timer.

  4. Create I/O Qpairs. Each new I/O qpair is a separate connection (a separate TCP socket, a separate RDMA QP). The initiator sends Connect on each one with the same subsystem NQN and the same controller ID; the target looks up the existing controller and attaches the new qpair to it.

  5. I/O commands. Read, write, flush, etc. Each goes to a specific NSID. The target's spdk_nvmf_ns translates NSID + bdev_io into a spdk_bdev_io and dispatches to the bdev stack.

  6. Keep Alive. The initiator sends periodic Keep Alive admin commands so the target knows the connection is still alive. If the timer expires, the target assumes the connection is dead and tears it down.

Edge cases & what trips people up

NVMe-oF is a small protocol, but it has a long list of small places where the implementation can surprise you. These are the ones we actually hit in production.

The target disappears mid-I/O

A host writes 4KB to a bdev. The bdev layer dispatches the I/O. The transport detects that the remote side closed the connection (TCP RST, RDMA CM event, vfio-user disconnect). What happens to the in-flight I/O?

The transport marks the qpair disconnected. The spdk_nvmf_qpair_disconnect function at include/spdk/nvmf.h:508 schedules a callback that aborts all in-flight I/O on the qpair with NVME_SC_ABORTED_BY_HOST or NVME_SC_HOST_PATH_ERROR depending on the cause. The application sees the I/O fail. The target's underlying bdev never gets the I/O; from its point of view, the request was aborted before it executed.

The namespace is removed while the host is connected

The target admin removes a namespace from a subsystem. What happens to the host's /dev/nvmeXnY?

NVMe-oF has a feature for this: AEN — Asynchronous Event Notifications. The target sends an AEN to every controller that has the removed namespace in its visible list, and the kernel re-issues Identify Namespace which now returns "NSID not in use." The block device stays in the kernel; new I/O to the removed NSID fails. The host's I/O stack handles the failure like any other I/O error.

What you cannot do: silently keep the bdev open and let writes land in a now-orphaned namespace. The framework's removal path will close the bdev desc. The spdk_nvmf_ns struct, the ns->desc field, and the bdev channel are all torn down by the time the removal completes.

Multiple paths to the same subsystem

You have two paths: RDMA on 10.0.0.1:4420 and TCP on 10.0.0.1:4430. Both lead to the same NQN. The host connects to both. Does the target create one controller or two?

The target creates one controller, with two qpairs. The Connect command on the second connection matches the existing controller by NQN + host NQN + subsystem, and the new qpair is attached to the same controller. This is by design: it makes multipath I/O work, and the host sees one block device with two paths under /sys/class/nvme.

The catch: each qpair gets its own set of in-flight I/O. The target's qpair_mask tracks all of them. When all qpairs are gone, the controller is destroyed.

Persistent reservations

SCSI had persistent reservations (PRs); NVMe has them too, in the form of Reservation Register, Reservation Acquire, and Reservation Release commands. The semantics are similar: a host can take an exclusive reservation on a namespace, and other hosts get I/O errors until it's released.

For SPDK as a target, the relevant point is that spdk_nvmf_subsystem_add_ns_ext initializes the namespace's reservation_info to all-zero. The reservation state lives in the namespace struct, not the controller. A controller can ask for a reservation; the reservation check is per-namespace. Two controllers on the same namespace can collide on a reservation; only one holds it at a time.

What to take away

NVMe-oF gives the NVMe command set a network. The new words are subsystem (the unit of export), namespace (one bdev, one NSID), listener (a network address on the target), and controller (one initiator's runtime state on one subsystem). The discovery subsystem is the always-on way to find the others. Every controller is owned by one SPDK thread; every namespace is a single open bdev desc; every listener is a transport address paired with a subsystem NQN.

The next page — nvmf_tgt — is about the binary that ties all four of these together, the state machine it walks at startup, and what happens when you push it hard.