The four objects, and the protocol that connects them.
NVMe-over-Fabrics takes the NVMe command set you already know from Layer 0.2 and tunnels it over a network. The protocol adds four new nouns that don't exist in PCIe-attached NVMe: subsystem, namespace, listener, and controller. This page is the vocabulary you'll need for the rest of Layer 6.
- What NVMe-oF actually does
- Initiator and target, end to end
- Subsystem — the "SCSI target" of NVMe-oF
- Namespace — one bdev, one NSID
- Listener — the network endpoint on the target
- Controller — the initiator-side state machine
- Discovery service — how initiators find targets
- Host NQN — the iSCSI IQN equivalent
- Admin queue vs I/O queues, across the network
- Edge cases & what trips people up
What NVMe-oF actually does
PCIe-attached NVMe puts the NVMe controller and the host's NVMe driver on the same machine, talking over a PCIe bus. The commands travel in a submission queue, completions come back in a completion queue, and that's it. NVMe-over-Fabrics — NVMe-oF for short — replaces that PCIe bus with a network. The host ("initiator") and the storage device ("target") are now two different processes, often on two different hosts, talking TCP or RDMA. The NVMe command set, the submission/completion queue abstraction, and the register interface are unchanged. What changes is how the bytes get from one side to the other.
The spec that defines this is the NVMe-oF specification, TP 8013 maintained by the NVMe Inc. technical work group. It is layered on top of the base NVMe spec. Every transport (RDMA, TCP, FC, VFIO-user) speaks the same wire-level command capsules and the same connection setup; only the framing differs. SPDK implements the target side in lib/nvmf/:lib/ .
Initiator and target, end to end
The protocol has exactly two roles, named the same way iSCSI names them:
| Role | What it does | SPDK component | Linux counterpart |
|---|---|---|---|
| Target | Owns the bdevs. Serves them to the network. Accepts connections. | nvmf_tgt binary, the SPDK lib/nvmf/ library | None — there's no in-kernel NVMe-oF target worth using. SPDK is the target. |
| Initiator | Connects to targets. Discovers them. Submits NVMe commands on behalf of an application. | (no SPDK component — see Layer 0) | Kernel nvme_tcp or nvme_rdma driver; the nvme-cli userland tool |
The asymmetry is the thing to internalize: SPDK only ever implements the target side. An SPDK process does not connect to an NVMe-oF target. It does, however, speak PCIe NVMe as a client of the in-kernel driver, which is a different protocol that happens to share the command set. The targets we build with diskengine and the SPDK nvmf_tgt are consumed by either the kernel nvme_tcp/nvme_rdma driver or by another SPDK process running in "PCIe NVMe client" mode.
flowchart LR subgraph "Initiator host" APP[Application
issues read/write syscalls] DRV[Kernel NVMe-OF driver
nvme_tcp or nvme_rdma] APP --> DRV end subgraph "Target host (SPDK)" NVMF[nvmf_tgt
the SPDK target] BDEV[bdev stack
malloc / nvme / lvol / passthru] NVMF --> BDEV end DRV -- "RDMA verbs or TCP/TLS" --> NVMF
fig. 1 The protocol is two-sided. The kernel driver on the
left submits NVMe capsules over RDMA or TCP. The nvmf_tgt process
on the right translates them into bdev I/O. The application sees a
normal /dev/nvmeXnY device and never knows the
storage is on another machine.
Subsystem — the "SCSI target" of NVMe-oF
A subsystem is the unit of export. It is a named collection of namespaces, with a list of allowed listeners and a list of allowed hosts. If you're coming from iSCSI, think of it as the iSCSI target: the thing you log in to.
In SPDK, the subsystem is a C struct with state, an NQN, a list of
namespaces, a list of listeners, a list of hosts, and a list of
currently connected controllers. Creation goes through
spdk_nvmf_subsystem_create — the function that lives
behind the JSON-RPC method nvmf_create_subsystem.
subnqn— the subsystem's NVMe Qualified Name. The string the initiator uses to identify which subsystem it wants to connect to. Same format rules as an iSCSI IQN. The well-known discovery NQN isnqn.2014-08.org.nvmexpress.discovery.state— one ofINACTIVE,ACTIVE, orPAUSED. You can only mutate a subsystem inINACTIVEorPAUSED. This is what makes namespace changes safe — you pause the subsystem, change it, resume it. See thespdk_nvmf_subsystem_add_ns_extprecondition check at lib/nvmf/subsystem.c:2243 .ctrlrs— every controller currently associated with this subsystem. Each connecting initiator gets one controller. When the last qpair for a controller goes away, the controller is destroyed. See lib/nvmf/nvmf.c:1442 for the "last qpair tears down the ctrlr" path.ns— an array of namespace pointers, indexed by NSID. One per NSID, up tomax_nsid. Created onspdk_nvmf_subsystem_add_ns_extat lib/nvmf/subsystem.c:2229 .next_cntlid— the controller ID to hand out next. Each connecting initiator gets a unique 16-bit ID, allocated from themin_cntlid..max_cntlidrange.
Namespace — one bdev, one NSID
A namespace is what NVMe-oF calls the exposed
block device. The unit of exposure is one bdev. When you call
spdk_nvmf_subsystem_add_ns_ext, the subsystem
spdk_bdev_open_ext_v2's the named bdev read-write
and stores the resulting spdk_bdev_desc inside an
spdk_nvmf_ns. The NSID is allocated from a free slot
in the subsystem's namespace array.
The mapping is one-to-one. Two namespaces cannot point to the same
bdev, and one namespace cannot expose two bdevs. If you want RAID
or concatenation, you do it in the bdev layer (a lvol
sits on top of an lvstore; a vbdev_passthru is one
to one). diskengine's wiring is at
NvmfSubsystemAddNs:114 .
Listener — the network endpoint on the target
A listener is the address a target listens on.
It's a transport ID: trtype (RDMA / TCP / VFIOUSER /
FC), adrfam (IPv4 / IPv6), traddr (the
IP, or for FC, the WWN), and trsvcid (the port
number, or for FC, the route).
Listeners belong to a subsystem, not to a target. Two subsystems can each have a listener on the same target, even on the same transport address — they're tagged by NQN, not by address. The acceptor poller is at the target level; the per-subsystem filter happens when a connection is established.
The function is short, but the work underneath is the meat. A
listener is added in three phases: the subsystem is paused (so no
new connections arrive), the transport actually starts listening
on the trid, and the subsystem is resumed. The transport's
listen_associate hook is where the work gets
transport-specific — for RDMA it opens a listening CM ID; for TCP
it opens a socket; for VFIO-user it sets up a vfio-user server
endpoint.
The opts argument is transport-specific: secure
channel (TLS) for TCP, ANA state, the socket implementation
selector, etc. The header is at
include/spdk/nvmf.h:908 .
Controller — the initiator-side state machine
A controller is the NVMe-oF equivalent of an "I_T nexus" in SCSI. It's the runtime state for one connecting initiator on one subsystem: an admin queue, a set of I/O qpairs, the negotiated limits, the controller ID, the host NQN, and the keep-alive timer.
A controller is created by the Fabrics Connect
command. The flow is: initiator sends a Connect capsule on the
admin qpair, the target's nvmf_ctrlr_create at
lib/nvmf/ctrlr.c:438 builds a fresh
spdk_nvmf_ctrlr, allocates a controller ID, sets up
the admin queue, and replies with a Connect response.
dynamic_ctrlr— true for fabric transports (RDMA/TCP), false for direct-attach transports (PCIe, VFIO-user). For fabric transports, the controller ID is allocated by the target at connect time. For non-fabric transports, the controller ID comes from the host'scntlidfield.thread— the SPDK thread the controller lives on. Always the qpair's poll group thread. This is the single constraint that keeps the controller lock-free: every operation on a controller runs on its thread. Cross-thread operations need aspdk_thread_send_msg.qpair_mask— a bit per qpair, used to find the last qpair when a controller is tearing down. When the last bit clears, the controller is destroyed. See lib/nvmf/nvmf.c:1442 for the "last qpair, free the ctrlr" path.
Discovery service — how initiators find targets
An initiator needs to know what subsystems exist before it can
connect. NVMe-oF handles this with a special pre-defined
subsystem called the discovery subsystem. Its NQN is
fixed: nqn.2014-08.org.nvmexpress.discovery.
Connect to the discovery subsystem and send
Get Log Page with LID 0x01. The target
returns a discovery log page: a list of all the
subsystems it knows about, with their NQN, transport type, and
the addresses where they can be reached.
Note the allow_any_host = true. Discovery is meant to
be open to all initiators — otherwise how would you find
anything? The discovery subsystem also can't have namespaces
(the num_ns argument is 0, and the create function
at lib/nvmf/subsystem.c:240 enforces
this).
sequenceDiagram participant Init as Initiator (kernel driver) participant DiscTgt as Target's discovery svc participant RegTgt as Target's regular subsystem Init->>DiscTgt: nvme connect (subnqn = nqn.2014-08...discovery, traddr = 10.0.0.1, RDMA) DiscTgt-->>Init: connect OK Init->>DiscTgt: Get Log Page 0x01 DiscTgt-->>Init: discovery log: [nqn-A on 10.0.0.1:4420 RDMA, nqn-B on 10.0.0.1:4421 RDMA] Init->>RegTgt: nvme connect (subnqn = nqn-A, traddr = 10.0.0.1, RDMA) RegTgt-->>Init: connect OK, ctrlr created
fig. 2 The initiator first connects to the discovery service (the fixed NQN), pulls a list of available subsystems, then connects to whichever one it actually wants. The discovery controller has no namespaces; it only serves the log page.
Host NQN — the iSCSI IQN equivalent
Every initiator has an identifier, just like every iSCSI
initiator has an IQN. It's called the Host NQN
and it goes into the connect_data at connect time.
The target uses it to decide whether the initiator is allowed
to talk to the subsystem.
By default, a subsystem is locked to a list of allowed Host NQNs
— spdk_nvmf_subsystem_add_host. Only initiators in
the list (or those allowed by the global
allow_any_host = true) can connect. The well-known
discovery NQN is normally open to everyone; production I/O
subsystems should be locked down.
The Host NQN lives in /etc/nvme/hostnqn on a Linux
initiator. It's persistent across reboots and uniquely identifies
a single physical machine. For a VM, it's whatever was set when
the VM was provisioned; in some hyperscaler environments it's
derived from the VM UUID.
Admin queue vs I/O queues, across the network
NVMe-oF keeps NVMe's two-queue model. Every connection carries one admin queue and zero or more I/O queues.
Transport-level connect. For TCP this is a socket
accept(). For RDMA this is an RDMA CMRECV+SENDexchange. For VFIO-user it's a connection to the vfio-user socket. The transport creates a "qpair" object but it has no NVMe state yet.Fabrics Connect. The initiator sends the first NVMe capsule on this qpair: a Connect command with
sqes,recfmt, the subsystem NQN, the host NQN, and the controller ID (or a request for a dynamic one). The target validates the host NQN against the subsystem's allow list, creates aspdk_nvmf_ctrlr, and replies.Property Get/Set. Admin commands. The initiator asks for the controller's limits and tunes its queue count, queue size, and the keep-alive timer.
Create I/O Qpairs. Each new I/O qpair is a separate connection (a separate TCP socket, a separate RDMA QP). The initiator sends
Connecton each one with the same subsystem NQN and the same controller ID; the target looks up the existing controller and attaches the new qpair to it.I/O commands. Read, write, flush, etc. Each goes to a specific NSID. The target's
spdk_nvmf_nstranslates NSID + bdev_io into aspdk_bdev_ioand dispatches to the bdev stack.Keep Alive. The initiator sends periodic
Keep Aliveadmin commands so the target knows the connection is still alive. If the timer expires, the target assumes the connection is dead and tears it down.
Edge cases & what trips people up
NVMe-oF is a small protocol, but it has a long list of small places where the implementation can surprise you. These are the ones we actually hit in production.
The target disappears mid-I/O
A host writes 4KB to a bdev. The bdev layer dispatches the I/O. The transport detects that the remote side closed the connection (TCP RST, RDMA CM event, vfio-user disconnect). What happens to the in-flight I/O?
The transport marks the qpair disconnected. The
spdk_nvmf_qpair_disconnect function at
include/spdk/nvmf.h:508 schedules a
callback that aborts all in-flight I/O on the qpair with
NVME_SC_ABORTED_BY_HOST or
NVME_SC_HOST_PATH_ERROR depending on the cause. The
application sees the I/O fail. The target's underlying bdev
never gets the I/O; from its point of view, the request was
aborted before it executed.
The namespace is removed while the host is connected
The target admin removes a namespace from a subsystem. What
happens to the host's /dev/nvmeXnY?
NVMe-oF has a feature for this: AEN — Asynchronous Event
Notifications. The target sends an AEN to every controller that
has the removed namespace in its visible list, and the kernel
re-issues Identify Namespace which now returns
"NSID not in use." The block device stays in the kernel; new I/O
to the removed NSID fails. The host's I/O stack handles the
failure like any other I/O error.
What you cannot do: silently keep the bdev open and let writes
land in a now-orphaned namespace. The framework's removal path
will close the bdev desc. The
spdk_nvmf_ns struct, the ns->desc
field, and the bdev channel are all torn down by the time the
removal completes.
Multiple paths to the same subsystem
You have two paths: RDMA on 10.0.0.1:4420 and TCP on 10.0.0.1:4430. Both lead to the same NQN. The host connects to both. Does the target create one controller or two?
The target creates one controller, with two qpairs. The
Connect command on the second connection matches
the existing controller by NQN + host NQN + subsystem, and the
new qpair is attached to the same controller. This is by design:
it makes multipath I/O work, and the host sees one block device
with two paths under /sys/class/nvme.
The catch: each qpair gets its own set of in-flight I/O. The target's qpair_mask tracks all of them. When all qpairs are gone, the controller is destroyed.
Persistent reservations
SCSI had persistent reservations (PRs); NVMe has them too, in
the form of Reservation Register,
Reservation Acquire, and
Reservation Release commands. The semantics are
similar: a host can take an exclusive reservation on a
namespace, and other hosts get I/O errors until it's released.
For SPDK as a target, the relevant point is that
spdk_nvmf_subsystem_add_ns_ext initializes the
namespace's reservation_info to all-zero. The
reservation state lives in the namespace struct, not the
controller. A controller can ask for a reservation; the
reservation check is per-namespace. Two controllers on the
same namespace can collide on a reservation; only one holds it
at a time.
What to take away
NVMe-oF gives the NVMe command set a network. The new words are subsystem (the unit of export), namespace (one bdev, one NSID), listener (a network address on the target), and controller (one initiator's runtime state on one subsystem). The discovery subsystem is the always-on way to find the others. Every controller is owned by one SPDK thread; every namespace is a single open bdev desc; every listener is a transport address paired with a subsystem NQN.
The next page — nvmf_tgt — is about the binary that ties all four of these together, the state machine it walks at startup, and what happens when you push it hard.