Layer 6 · nvmf

The binary, the state machine, and the thread model.

nvmf_tgt is the SPDK application that does one thing: serve bdevs over the network using the NVMe-oF protocol. Inside that one thing is a careful state machine, a per-core poll group mesh, a transport dispatcher, and a long list of small places where things can go wrong under load. This page walks the binary from the moment main() calls spdk_app_start to the moment the last qpair is torn down at shutdown.

~15 min read2 diagramsprerequisites: 6.1 · 2.1 · 2.3
On this page
  1. What nvmf_tgt is, and what it isn't
  2. Configuration: nvmf.json vs JSON-RPC
  3. The startup state machine
  4. How nvmf_tgt learns about existing bdevs
  5. Poll groups: one reactor per core, plus transport threads
  6. Connection accept and dispatch
  7. The shutdown state machine
  8. Edge cases: hot reload, too many connections, hot quitters

What nvmf_tgt is, and what it isn't

nvmf_tgt is a thin binary. It exists to do three things:

  1. Register an spdk_subsystem named "nvmf" so the framework knows to call it during init.
  2. Drive the state machine at startup (create target, create poll groups, start subsystems) and shutdown (stop subsystems, stop listeners, destroy poll groups, destroy target).
  3. Wire up the JSON-RPC handlers that manage subsystems, listeners, and namespaces.

That's it. The actual work — accepting connections, processing NVMe commands, dispatching to bdevs — is in the lib/nvmf/ library. nvmf_tgt is the application layer; the library is the implementation.

In modern SPDK, nvmf_tgt is built as a subsystem via SPDK_SUBSYSTEM_REGISTER. It lives at

module/event/subsystems/nvmf/nvmf_tgt.c:937

— not at the old app/nvmf_tgt/ path. The framework discovers it through the same registration machinery that all other SPDK apps use.

The dependency declarations matter. nvmf cannot initialize until bdev has finished (you can't create namespaces on bdevs that don't exist yet), and sock must be up for the TCP transport. keyring is needed for TLS / DH-HMAC-CHAP. The framework respects these via spdk_subsystem_init_next / spdk_subsystem_fini_next.

Configuration: nvmf.json vs JSON-RPC

There are two ways to configure nvmf_tgt. Both end up calling the same C functions, but they live in different processes at different times.

ModeWhen it runsWhere the work happensWho calls it
Static config (-c nvmf.json)At spdk_app_start time, before app_run begins pollingSame spdk_nvmf_* C functions as the RPC pathThe framework, parsing the JSON
JSON-RPC (nvmf_create_subsystem etc.)Any time after the framework is upRPC handler at lib/nvmf/nvmf_rpc.c:371 An external client (diskengine, spdkcli, custom control plane)

For diskengine's purposes, the static config path is rarely used; everything is dynamic. The flow you care about is nvmf_create_subsystemnvmf_subsystem_add_listenernvmf_subsystem_add_ns, all called by Go over the Unix socket. The matching C handlers in lib/nvmf/nvmf_rpc.c:371 , line=866, and line=1552 are what actually mutate the SPDK state.

The startup state machine

nvmf_tgt walks a fixed sequence of states at startup. The state machine is in nvmf_tgt_advance_state at module/event/subsystems/nvmf/nvmf_tgt.c:721 . It is a giant switch driven by a static g_tgt_state variable.

flowchart TB
NONE[INIT_NONE]
CREATE[INIT_CREATE_TARGET]
POLL[INIT_CREATE_POLL_GROUPS]
START[INIT_START_SUBSYSTEMS]
RUNNING[RUNNING]

NONE --> CREATE
CREATE --> POLL
POLL --> START
START --> RUNNING
RUNNING --> FINI_STOP[FINI_STOP_SUBSYSTEMS]
FINI_STOP --> FINI_LISTEN[FINI_STOP_LISTEN]
FINI_LISTEN --> FINI_SUBS[FINI_DESTROY_SUBSYSTEMS]
FINI_SUBS --> FINI_POLL[FINI_DESTROY_POLL_GROUPS]
FINI_POLL --> FINI_TGT[FINI_DESTROY_TARGET]
FINI_TGT --> STOPPED[STOPPED]
fig. 1 — the nvmf_tgt startup state machine · tap or scroll to zoom · ↗ for fullscreen

fig. 1   Each state does one chunk of work, then either transitions to the next state or kicks off async work whose completion transitions to the next state. The "do while (g_tgt_state != prev_state)" loop at

module/event/subsystems/nvmf/nvmf_tgt.c:845

runs synchronously through whatever states can advance immediately.

The transitions and their work:

STEP 01
INIT_NONE → CREATE_TARGET
spdk_nvmf_tgt_create + discovery svc
STEP 02
CREATE_TARGET → CREATE_POLL_GROUPS
spawn nvmf_tgt_poll_group_NNN threads
STEP 03
CREATE_POLL_GROUPS → START_SUBSYSTEMS
spdk_nvmf_subsystem_start on each subsystem
STEP 04
START_SUBSYSTEMS → RUNNING
spdk_subsystem_init_next(0)

Three things to notice:

  1. State advances before doing work. The first iteration moves from INIT_NONE to INIT_CREATE_TARGET without doing anything. The second iteration does the work. This lets the do-while loop chain synchronous transitions without re-entering the switch.

  2. Poll group creation spawns threads, then returns. The nvmf_tgt_create_poll_groups function at

    module/event/subsystems/nvmf/nvmf_tgt.c:208

    sends a message to each new thread asking it to create a poll group, then returns. The state machine transitions to INIT_START_SUBSYSTEMS immediately; the START_SUBSYSTEMS work doesn't actually start until all poll group creation callbacks have fired.

  3. Subsystem start walks a linked list. Each subsystem's start callback grabs the next one and calls spdk_nvmf_subsystem_start on it. When the list is empty, we transition to RUNNING.

How nvmf_tgt learns about existing bdevs

Here's a thing that surprises everyone the first time. nvmf_tgt does not scan the system, load config, or build bdevs. It assumes the bdev framework has already done that.

The dependency is declared at the bottom of the file: SPDK_SUBSYSTEM_DEPEND(nvmf, bdev). This tells the framework "do not start me until bdev is up." By the time nvmf_subsystem_init runs at module/event/subsystems/nvmf/nvmf_tgt.c:848 , every bdev module has finished its module_init, every bdev has been registered, and the bdev tree is queryable via spdk_bdev_first / spdk_bdev_next.

When the framework hands nvmf_tgt a JSON-RPC request like nvmf_subsystem_add_ns with bdev_name = "lvstore-uuid/lvol-123", the subsystem code calls spdk_bdev_open_ext_v2 with that name. The bdev framework looks up the bdev by name in its global registry. No coordination with nvmf_tgt; the bdev was registered by bdev_lvol (or bdev_nvme, or whoever) back during the bdev subsystem's init_complete phase.

Poll groups: one reactor per core, plus transport threads

nvmf_tgt is fundamentally a polling application. There is no interrupt handler that wakes it up; the framework's reactor thread per core runs a tight loop calling the transport's poll_group_poll function. The piece of state that ties a transport, a set of connections, and an SPDK thread together is the poll group.

The thread count is the SPDK core count (or the size of g_poll_groups_mask if one is configured). One thread per core, named nvmf_tgt_poll_group_000, nvmf_tgt_poll_group_001, etc. Each thread is a regular spdk_thread — same as any other SPDK thread — but it never gets a message-queue poll, it just runs the transport poller in a tight loop.

The actual poll group is created on the new thread. Look at nvmf_tgt_create_poll_group at module/event/subsystems/nvmf/nvmf_tgt.c:189 :

pg->thread = spdk_get_thread();
pg->group = spdk_nvmf_poll_group_create(g_spdk_nvmf_tgt);
spdk_thread_send_msg(g_tgt_init_thread, nvmf_tgt_create_poll_group_done, pg);

The thread is captured first, then the poll group is created on the thread (so the poll group's internal spdk_io_channel belongs to the right thread), then a message is sent back to the init thread with the result. The init thread's nvmf_tgt_create_poll_group_done callback at line 165 increments a counter; when the count matches the expected total, the state machine advances to INIT_START_SUBSYSTEMS.

Connection accept and dispatch

Connections arrive on a transport listener. The transport's accept path is transport-specific (RDMA CM event, TCP accept, vfio-user connection), but the dispatch from there is uniform:

flowchart LR
REMOTE["Remote host (kernel driver)"]
T1["spdk_nvmf_transport.listen()
sock or RDMA CM"] T2["spdk_nvmf_tgt_listen_ext()"] T3["spdk_nvmf_subsystem_add_listener_ext()"] ACCEPT["acceptor poller
(per transport, per target)"] PICK["pick poll group
(round-robin or get_optimal)"] QPAIR["spdk_nvmf_qpair created
on chosen poll group's thread"] FABRIC["wait for Fabrics Connect capsule"] CTRL["spdk_nvmf_ctrlr created
(if allowed)"] READY["qpair ready for I/O"] REMOTE --> T1 T1 --> T2 T2 --> T3 T3 --> ACCEPT ACCEPT --> PICK PICK --> QPAIR QPAIR --> FABRIC FABRIC --> CTRL CTRL --> READY
fig. 2 — accept and dispatch · tap or scroll to zoom · ↗ for fullscreen

fig. 2   The transport accepts a connection. The new qpair is attached to a poll group. The acceptor poller waits for a Fabrics Connect command. If the host NQN passes the allow-list check, a controller is created and the qpair is moved into the controller's qpair list.

The "pick poll group" step is where the acceptor hands the connection off. The transport gets a vote: it implements get_optimal_poll_group to suggest the best candidate (often "the same core the NIC's IRQ is on"). The framework falls back to round-robin if the transport doesn't care. Look at the transport ops struct in lib/nvmf/rdma.c:5383 for the RDMA case.

Once the qpair is attached, it is polled continuously. The transport's poll_group_poll checks for incoming capsules; when one arrives, it dispatches to nvmf_ctrlr_process_admin_cmd or nvmf_ctrlr_process_io_cmd, depending on the qpair's qid. Qid 0 is admin; everything else is I/O.

The shutdown state machine

Shutdown is the startup state machine in reverse, plus a few extra states to drain in-flight I/O. The trigger is a SIGINT or SIGTERM, which the framework's app framework converts to a call to nvmf_subsystem_fini at module/event/subsystems/nvmf/nvmf_tgt.c:855 .

nvmf_subsystem_fini calls nvmf_shutdown_cb at line 83, which guards against being called during startup:

if (g_tgt_state < NVMF_TGT_RUNNING) {
    spdk_thread_send_msg(spdk_get_thread(), nvmf_shutdown_cb, NULL);
    return;
}

If we're still initializing, defer. The framework will call fini again when the state machine actually reaches RUNNING. Once we're running, transition to FINI_STOP_SUBSYSTEMS and start walking the shutdown chain:

STEP 01
STOP_SUBSYSTEMS
spdk_nvmf_subsystem_stop on each
STEP 02
STOP_LISTEN
stop_listen on every listener
STEP 03
DESTROY_SUBSYSTEMS
spdk_nvmf_subsystem_destroy
STEP 04
DESTROY_POLL_GROUPS
thread-by-thread teardown
STEP 05
DESTROY_TARGET
spdk_nvmf_tgt_destroy
STEP 06
STOPPED
spdk_subsystem_fini_next

The reason for the elaborate sequence is that the states are not all synchronous. spdk_nvmf_subsystem_stop has to wait for in-flight I/O to drain; poll_group_destroy has to wait for every qpair to be released; and so on. Each state is followed by a callback that transitions to the next state. The chain is on a single thread (the original init thread, captured at the start of poll group creation), so there's no race between state transitions.

Edge cases & what trips people up

Hot config reload: JSON-RPC during a running target

You can call nvmf_create_subsystem while a target is already up. The framework does not stop the world; the new subsystem is created, started, and its listeners begin accepting new connections. Existing connections are not touched.

What you cannot do: mutate a running subsystem. Adding a namespace to an active subsystem requires pausing it first (spdk_nvmf_subsystem_pause), which blocks new admin commands, drains I/O, then transitions to PAUSED. The RPC layer for nvmf_subsystem_add_ns does this for you — see lib/nvmf/nvmf_rpc.c:1552 . The pause is per-subsystem and does not affect other subsystems on the same target.

Too many connections

Every connection is a file descriptor (TCP) or a QP (RDMA) or a vfio-user endpoint, plus per-qpair memory. The transport's max queue depth, max io qpairs per ctrlr, and the max_qpairs_per_ctrlr field in spdk_nvmf_transport_opts cap things. The actual failure modes:

  • RDMA: out of MR (memory region) entries in the HCA, or out of CQ (completion queue) entries. RDMA resource exhaustion is fatal — there's no way to ask the HCA for more. Tune max_srq_depth, data_wr_pool_size, and the HCA's max_mr / max_qp sysfs values.

  • TCP: fd exhaustion. ulimit -n and the kernel's fs.nr_open. Set LimitNOFILE=infinity in the systemd unit, and tune the kernel.

  • VFIO-user: no hard cap; the limit is process memory and number of open inotify watches.

A single connection going wild

One bad initiator submitting thousands of read I/Os per second can starve other connections on the same poll group. The nvmf_tgt thread model is per-core; if all your connections land on the same core (which is the default if you have one CPU socket and no NUMA awareness), the entire target's throughput can collapse.

Mitigations: pin poll groups to specific cores via poll_groups_mask; use the per-connection max_queue_depth to cap a single qpair's outstanding I/O. The QoS rate limiter on the bdev layer (covered in 4.2) can also be used.

Shutdown ordering

The state machine is "polite" — it stops new I/O, drains in-flight, then disconnects. But the application can also be killed hard (SIGKILL). If a host has open I/O when nvmf_tgt is SIGKILLed, the next spdk_bdev_io_complete from the bdev stack will dereference a stale qpair pointer. This is a process-exit cleanup problem; the kernel sees the TCP socket close, the RDMA QP go away, the vfio-user endpoint vanish, and the host's I/O fails with NVME_SC_HOST_PATH_ERROR.

The right way to shut down is SIGTERM, not SIGKILL. The framework installs signal handlers that trigger nvmf_subsystem_fini, which walks the well-defined state machine and drains cleanly.

What to take away

nvmf_tgt is a small binary with a long state machine. Its job is to wire the framework's reactor model, the bdev stack, and the nvmf library's poll group model into a single coherent target. The state machine has twelve states — six for startup, six for shutdown — and every transition is either synchronous (one call returns, advance the state) or asynchronous (one call's callback advances the state). Poll groups are per-core, named, and live for the life of the process.

The next page — Transports — is about the three ways the protocol actually moves bytes: RDMA, TCP, and VFIO-user. Each has its own transport struct, its own verbs or sockets, and its own failure modes.