The binary, the state machine, and the thread model.
nvmf_tgt is the SPDK application that does one thing: serve
bdevs over the network using the NVMe-oF protocol. Inside that
one thing is a careful state machine, a per-core poll group
mesh, a transport dispatcher, and a long list of small places
where things can go wrong under load. This page walks the
binary from the moment main() calls
spdk_app_start to the moment the last qpair is
torn down at shutdown.
- What nvmf_tgt is, and what it isn't
- Configuration:
nvmf.jsonvs JSON-RPC - The startup state machine
- How nvmf_tgt learns about existing bdevs
- Poll groups: one reactor per core, plus transport threads
- Connection accept and dispatch
- The shutdown state machine
- Edge cases: hot reload, too many connections, hot quitters
What nvmf_tgt is, and what it isn't
nvmf_tgt is a thin binary. It exists to do three
things:
- Register an
spdk_subsystemnamed"nvmf"so the framework knows to call it during init. - Drive the state machine at startup (create target, create poll groups, start subsystems) and shutdown (stop subsystems, stop listeners, destroy poll groups, destroy target).
- Wire up the JSON-RPC handlers that manage subsystems, listeners, and namespaces.
That's it. The actual work — accepting connections, processing
NVMe commands, dispatching to bdevs — is in the
lib/nvmf/ library. nvmf_tgt is the application
layer; the library is the implementation.
In modern SPDK, nvmf_tgt is built as a subsystem via
SPDK_SUBSYSTEM_REGISTER. It lives at
— not at the old app/nvmf_tgt/ path. The framework
discovers it through the same registration machinery that all
other SPDK apps use.
The dependency declarations matter. nvmf cannot
initialize until bdev has finished (you can't
create namespaces on bdevs that don't exist yet), and
sock must be up for the TCP transport.
keyring is needed for TLS / DH-HMAC-CHAP. The
framework respects these via
spdk_subsystem_init_next /
spdk_subsystem_fini_next.
Configuration: nvmf.json vs JSON-RPC
There are two ways to configure nvmf_tgt. Both end up calling the same C functions, but they live in different processes at different times.
| Mode | When it runs | Where the work happens | Who calls it |
|---|---|---|---|
Static config (-c nvmf.json) | At spdk_app_start time, before app_run begins polling | Same spdk_nvmf_* C functions as the RPC path | The framework, parsing the JSON |
JSON-RPC (nvmf_create_subsystem etc.) | Any time after the framework is up | RPC handler at lib/nvmf/nvmf_rpc.c:371 | An external client (diskengine, spdkcli, custom control plane) |
For diskengine's purposes, the static config path is rarely used;
everything is dynamic. The flow you care about is
nvmf_create_subsystem → nvmf_subsystem_add_listener
→ nvmf_subsystem_add_ns, all called by Go over the
Unix socket. The matching C handlers in
lib/nvmf/nvmf_rpc.c:371 , line=866,
and line=1552 are what actually mutate the SPDK state.
The startup state machine
nvmf_tgt walks a fixed sequence of states at startup. The
state machine is in nvmf_tgt_advance_state at
module/event/subsystems/nvmf/nvmf_tgt.c:721 .
It is a giant switch driven by a static
g_tgt_state variable.
flowchart TB NONE[INIT_NONE] CREATE[INIT_CREATE_TARGET] POLL[INIT_CREATE_POLL_GROUPS] START[INIT_START_SUBSYSTEMS] RUNNING[RUNNING] NONE --> CREATE CREATE --> POLL POLL --> START START --> RUNNING RUNNING --> FINI_STOP[FINI_STOP_SUBSYSTEMS] FINI_STOP --> FINI_LISTEN[FINI_STOP_LISTEN] FINI_LISTEN --> FINI_SUBS[FINI_DESTROY_SUBSYSTEMS] FINI_SUBS --> FINI_POLL[FINI_DESTROY_POLL_GROUPS] FINI_POLL --> FINI_TGT[FINI_DESTROY_TARGET] FINI_TGT --> STOPPED[STOPPED]
fig. 1 Each state does one chunk of work, then either
transitions to the next state or kicks off async work whose
completion transitions to the next state. The "do while
(g_tgt_state != prev_state)" loop at
runs synchronously through whatever states can advance immediately.
The transitions and their work:
Three things to notice:
State advances before doing work. The first iteration moves from
INIT_NONEtoINIT_CREATE_TARGETwithout doing anything. The second iteration does the work. This lets thedo-whileloop chain synchronous transitions without re-entering the switch.Poll group creation spawns threads, then returns. The
module/event/subsystems/nvmf/nvmf_tgt.c:208nvmf_tgt_create_poll_groupsfunction atsends a message to each new thread asking it to create a poll group, then returns. The state machine transitions to
INIT_START_SUBSYSTEMSimmediately; theSTART_SUBSYSTEMSwork doesn't actually start until all poll group creation callbacks have fired.Subsystem start walks a linked list. Each subsystem's start callback grabs the next one and calls
spdk_nvmf_subsystem_starton it. When the list is empty, we transition toRUNNING.
How nvmf_tgt learns about existing bdevs
Here's a thing that surprises everyone the first time. nvmf_tgt does not scan the system, load config, or build bdevs. It assumes the bdev framework has already done that.
The dependency is declared at the bottom of the file:
SPDK_SUBSYSTEM_DEPEND(nvmf, bdev). This tells the
framework "do not start me until bdev is up." By the time
nvmf_subsystem_init runs at
module/event/subsystems/nvmf/nvmf_tgt.c:848 ,
every bdev module has finished its module_init,
every bdev has been registered, and the bdev tree is queryable
via spdk_bdev_first / spdk_bdev_next.
When the framework hands nvmf_tgt a JSON-RPC request like
nvmf_subsystem_add_ns with
bdev_name = "lvstore-uuid/lvol-123", the subsystem
code calls
spdk_bdev_open_ext_v2 with that name. The bdev
framework looks up the bdev by name in its global registry.
No coordination with nvmf_tgt; the bdev was registered by
bdev_lvol (or bdev_nvme, or whoever)
back during the bdev subsystem's init_complete
phase.
Poll groups: one reactor per core, plus transport threads
nvmf_tgt is fundamentally a polling application. There is no
interrupt handler that wakes it up; the framework's reactor
thread per core runs a tight loop calling the transport's
poll_group_poll function. The piece of state that
ties a transport, a set of connections, and an SPDK thread
together is the poll group.
The thread count is the SPDK core count (or the size of
g_poll_groups_mask if one is configured). One
thread per core, named nvmf_tgt_poll_group_000,
nvmf_tgt_poll_group_001, etc. Each thread is a
regular spdk_thread — same as any other SPDK
thread — but it never gets a message-queue poll, it just runs
the transport poller in a tight loop.
The actual poll group is created on the new thread. Look at
nvmf_tgt_create_poll_group at
module/event/subsystems/nvmf/nvmf_tgt.c:189 :
pg->thread = spdk_get_thread();
pg->group = spdk_nvmf_poll_group_create(g_spdk_nvmf_tgt);
spdk_thread_send_msg(g_tgt_init_thread, nvmf_tgt_create_poll_group_done, pg);The thread is captured first, then the poll group is created
on the thread (so the poll group's internal
spdk_io_channel belongs to the right thread), then
a message is sent back to the init thread with the result. The
init thread's nvmf_tgt_create_poll_group_done
callback at line 165 increments a counter; when the count
matches the expected total, the state machine advances to
INIT_START_SUBSYSTEMS.
Connection accept and dispatch
Connections arrive on a transport listener. The transport's accept path is transport-specific (RDMA CM event, TCP accept, vfio-user connection), but the dispatch from there is uniform:
flowchart LR REMOTE["Remote host (kernel driver)"] T1["spdk_nvmf_transport.listen()
sock or RDMA CM"] T2["spdk_nvmf_tgt_listen_ext()"] T3["spdk_nvmf_subsystem_add_listener_ext()"] ACCEPT["acceptor poller
(per transport, per target)"] PICK["pick poll group
(round-robin or get_optimal)"] QPAIR["spdk_nvmf_qpair created
on chosen poll group's thread"] FABRIC["wait for Fabrics Connect capsule"] CTRL["spdk_nvmf_ctrlr created
(if allowed)"] READY["qpair ready for I/O"] REMOTE --> T1 T1 --> T2 T2 --> T3 T3 --> ACCEPT ACCEPT --> PICK PICK --> QPAIR QPAIR --> FABRIC FABRIC --> CTRL CTRL --> READY
fig. 2 The transport accepts a connection. The new qpair is attached to a poll group. The acceptor poller waits for a Fabrics Connect command. If the host NQN passes the allow-list check, a controller is created and the qpair is moved into the controller's qpair list.
The "pick poll group" step is where the acceptor hands the
connection off. The transport gets a vote: it implements
get_optimal_poll_group to suggest the best
candidate (often "the same core the NIC's IRQ is on"). The
framework falls back to round-robin if the transport doesn't
care. Look at the transport ops struct in
lib/nvmf/rdma.c:5383 for the RDMA
case.
Once the qpair is attached, it is polled continuously. The
transport's poll_group_poll checks for incoming
capsules; when one arrives, it dispatches to
nvmf_ctrlr_process_admin_cmd or
nvmf_ctrlr_process_io_cmd, depending on the
qpair's qid. Qid 0 is admin; everything else is I/O.
The shutdown state machine
Shutdown is the startup state machine in reverse, plus a few
extra states to drain in-flight I/O. The trigger is a SIGINT or
SIGTERM, which the framework's app framework converts to a call
to nvmf_subsystem_fini at
module/event/subsystems/nvmf/nvmf_tgt.c:855 .
nvmf_subsystem_fini calls
nvmf_shutdown_cb at line 83, which guards against
being called during startup:
if (g_tgt_state < NVMF_TGT_RUNNING) {
spdk_thread_send_msg(spdk_get_thread(), nvmf_shutdown_cb, NULL);
return;
}If we're still initializing, defer. The framework will call
fini again when the state machine actually reaches
RUNNING. Once we're running, transition to
FINI_STOP_SUBSYSTEMS and start walking the
shutdown chain:
The reason for the elaborate sequence is that the states are
not all synchronous. spdk_nvmf_subsystem_stop has
to wait for in-flight I/O to drain; poll_group_destroy
has to wait for every qpair to be released; and so on. Each
state is followed by a callback that transitions to the next
state. The chain is on a single thread (the original init
thread, captured at the start of poll group creation), so
there's no race between state transitions.
Edge cases & what trips people up
Hot config reload: JSON-RPC during a running target
You can call nvmf_create_subsystem while a target
is already up. The framework does not stop the world; the new
subsystem is created, started, and its listeners begin
accepting new connections. Existing connections are not
touched.
What you cannot do: mutate a running subsystem. Adding a
namespace to an active subsystem requires pausing it first
(spdk_nvmf_subsystem_pause), which blocks new
admin commands, drains I/O, then transitions to
PAUSED. The RPC layer for
nvmf_subsystem_add_ns does this for you — see
lib/nvmf/nvmf_rpc.c:1552 . The pause
is per-subsystem and does not affect other subsystems on the
same target.
Too many connections
Every connection is a file descriptor (TCP) or a QP (RDMA) or
a vfio-user endpoint, plus per-qpair memory. The transport's
max queue depth, max io qpairs per ctrlr, and the
max_qpairs_per_ctrlr field in
spdk_nvmf_transport_opts cap things. The actual
failure modes:
RDMA: out of MR (memory region) entries in the HCA, or out of CQ (completion queue) entries. RDMA resource exhaustion is fatal — there's no way to ask the HCA for more. Tune
max_srq_depth,data_wr_pool_size, and the HCA'smax_mr/max_qpsysfs values.TCP: fd exhaustion.
ulimit -nand the kernel'sfs.nr_open. SetLimitNOFILE=infinityin the systemd unit, and tune the kernel.VFIO-user: no hard cap; the limit is process memory and number of open inotify watches.
A single connection going wild
One bad initiator submitting thousands of read I/Os per second can starve other connections on the same poll group. The nvmf_tgt thread model is per-core; if all your connections land on the same core (which is the default if you have one CPU socket and no NUMA awareness), the entire target's throughput can collapse.
Mitigations: pin poll groups to specific cores via
poll_groups_mask; use the per-connection
max_queue_depth to cap a single qpair's
outstanding I/O. The QoS rate limiter on the bdev layer
(covered in 4.2) can
also be used.
Shutdown ordering
The state machine is "polite" — it stops new I/O, drains
in-flight, then disconnects. But the application can also be
killed hard (SIGKILL). If a host has open I/O when nvmf_tgt
is SIGKILLed, the next spdk_bdev_io_complete
from the bdev stack will dereference a stale qpair pointer.
This is a process-exit cleanup problem; the kernel sees the
TCP socket close, the RDMA QP go away, the vfio-user endpoint
vanish, and the host's I/O fails with
NVME_SC_HOST_PATH_ERROR.
The right way to shut down is SIGTERM, not SIGKILL. The
framework installs signal handlers that trigger
nvmf_subsystem_fini, which walks the
well-defined state machine and drains cleanly.
What to take away
nvmf_tgt is a small binary with a long state machine. Its job is to wire the framework's reactor model, the bdev stack, and the nvmf library's poll group model into a single coherent target. The state machine has twelve states — six for startup, six for shutdown — and every transition is either synchronous (one call returns, advance the state) or asynchronous (one call's callback advances the state). Poll groups are per-core, named, and live for the life of the process.
The next page — Transports — is about the three ways the protocol actually moves bytes: RDMA, TCP, and VFIO-user. Each has its own transport struct, its own verbs or sockets, and its own failure modes.