From a Go SELECT to a kernel /dev/nvmeXnY.
The provisioning flow is the bread and butter of diskengine.
A row in the lvols table with state
CREATING gets picked up, an lvol is created on a
lvstore, the lvol becomes a namespace on a subsystem, the
subsystem listens on RDMA, and a remote host connects and sees
a new block device. This page traces every step with real
code.
- The provisioning loop, top of the file
- processProvisioning — fetching the work
- provisionLvol — the per-lvol driver
- ensureNvmeofReady — transport, subsystem, listener
- BdevLvolCreate → SPDK: rpc_bdev_lvol_create
- NvmfSubsystemAddNs → SPDK: rpc_nvmf_subsystem_add_ns
- Listener add: spdk_nvmf_subsystem_add_listener_ext
- Host side: kernel NVMe discover, kernel connect
- The whole flow, with a sequence diagram
- Edge cases: already-exists, partial cleanup, state drift
The provisioning loop, top of the file
The provisioning flow is driven by a goroutine that wakes
every second. It lives in
provisioningLoop:16 . The job: scan the
database for lvols in state CREATING, do whatever
needs doing to make them visible over NVMe-oF, mark them
READY when done.
Three things happen each tick:
processProvisioning— the actual work. Read below.verifyState— compare what the database says to what SPDK has, log drift. Drift is normal (it happens on restart when SPDK has lvols the database doesn't know about yet) but excessive drift is a bug.SaveConfig— write a JSON snapshot ofbdevandnvmfstate toSPDK_CONFIG_PATH. This is what makes the configuration survive an nvmf_tgt restart. The file is consumed by nvmf_tgt's-cflag at next startup.
processProvisioning — fetching the work
The first thing processProvisioning does is query
the database. The query: "give me all lvols in state
CREATING for this bare metal node." The result is
a slice of repository.LvolToProcess.
The loop is "fail-soft" — a failure on one lvol does not stop the rest. This is important because the work involves multiple JSON-RPC calls to a separate process (nvmf_tgt), and any one of them can fail transiently.
provisionLvol — the per-lvol driver
For each lvol, the work is a fixed sequence. The function
provisionLvol at
provisionLvol:74 is the orchestrator. It
doesn't do any SPDK work itself; it just calls the right
methods on spdkClient in the right order.
flowchart TB Start([Start: lvol in CREATING state]) --> Validate["Validate NQN, RDMA IP, port are set"] Validate --> EnsureReady["ensureNvmeofReady
(transport + subsystem + listener)"] EnsureReady --> BdevCreate["BdevLvolCreate
(creates the lvol)"] BdevCreate --> AddNs["NvmfSubsystemAddNs
(attaches lvol to subsystem)"] AddNs --> Finalize["FinalizeProvisioningForLvol
(mark lvol READY in DB)"] Finalize --> Done([End: lvol is READY])
fig. 1 The four steps. Each one is a single JSON-RPC
call. A failure in any one returns an error; the next tick
retries from the same place (the lvol is still in
CREATING).
The validation step is a guard. An lvol in CREATING
state without placement information is a corrupted record; we
refuse to touch it. The error is logged and the lvol stays in
CREATING for human inspection.
ensureNvmeofReady — transport, subsystem, listener
The ensureNvmeofReady function in
ensureNvmeofReady:13 is the
"idempotent setup" routine. It checks what's already in SPDK
and creates whatever's missing. The pattern: read state, do a
comparison, call the create only if needed.
The three things being ensured, in order:
An RDMA transport. The
nvmf_create_transportRPC creates the transport object inside the target. It's a one-time setup; once the transport exists, all subsystems can add listeners on it. The transport options (ioUnitSize = 16384,numShared = 1024,maxQueueDepth = 256, etc.) are tuned for diskengine's expected workload.A subsystem with the lvol's NQN. The
nvmf_create_subsystemRPC creates the subsystem and starts it. Note theAllowAnyHost: true— diskengine doesn't enforce host NQN allowlists at this layer; it relies on the orchestrator to place authorized hosts. Production deployments would set this tofalseand usenvmf_subsystem_add_hostexplicitly.A listener on the right address. The
nvmf_subsystem_add_listenerRPC, withtrtype = RDMA, the IP, and the port. This is what makes the subsystem actually reachable.
BdevLvolCreate → SPDK: rpc_bdev_lvol_create
With the subsystem ready, the next call creates the lvol on the lvstore. This is the call that produces the lvol bdev that the namespace will be attached to. The wrapper is at BdevLvolCreate:42 .
The call sends a JSON-RPC request over the Unix socket. The
request is a struct with the lvstore UUID, the lvol name
(a numeric ID, derived from lvol.LvolID), the
size in MiB, the clear method, and a thin-provisioning flag.
The transport is JSON over a Unix socket — see
3.1 for the wire
format details.
On the SPDK side, the RPC handler is
rpc_bdev_lvol_create in
lib/lvol/lvol_rpc.c:rpc_bdev_lvol_create .
The handler is in the lvol library, not the nvmf library, but
it's invoked through the same JSON-RPC dispatch. The work
is: validate the params, find the lvstore, call
spdk_lvol_create, and respond with the new
lvol's UUID.
The response back to diskengine is the lvol's UUID. This
UUID is what we'll attach to the subsystem in the next
call. The lvol name visible in SPDK is
<lvstore_name>/<lvol_name> — for
example, lvstore-0/1234.
NvmfSubsystemAddNs → SPDK: rpc_nvmf_subsystem_add_ns
The lvol exists. Now we expose it. The call is
NvmfSubsystemAddNs, with the subsystem's NQN
and the new lvol's name.
The "already exists" branch is critical. It handles the
case where the previous attempt got partway through: the
lvol was created but the namespace attach failed, or the
process crashed between the two calls. On the next tick,
BdevLvolCreate would fail (lvol already
exists), and then NvmfSubsystemAddNs would
fail (namespace already exists). The
isAlreadyExistsErr helper detects this and
recovers by verifying the actual state.
On the SPDK side, the call goes to
rpc_nvmf_subsystem_add_ns at
lib/nvmf/nvmf_rpc.c:1552 . The
handler validates that the subsystem is in
INACTIVE or PAUSED state, parses
the bdev name, then calls
spdk_nvmf_subsystem_add_ns_ext. That function
at
lib/nvmf/subsystem.c:2229 does
the real work: spdk_bdev_open_ext_v2 on the
bdev, allocate an NSID, populate the namespace struct.
Listener add: spdk_nvmf_subsystem_add_listener_ext
We've already covered this in
ensureNvmeofReady — but it's worth re-stating
the call chain, because this is what makes the subsystem
actually reachable.
The order is important. The subsystem is paused before the listener is added, and resumed after. This means: no new connections arrive while the listener is in flux, and existing connections are not touched. The pause is short — milliseconds — and is invisible to the host.
Host side: kernel NVMe discover, kernel connect
Once the target is listening, the host can connect. The typical flow on a Linux host:
# Discover what's available
nvme discover -t rdma -a 10.0.0.1 -s 4420
# Output:
# Discovery Log Number of Records: 1, Generation counter: 1
# =====Discovery Log Entry 0======
# trtype: rdma
# adrfam: ipv4
# subtype: nvme subsystem
# treq: not specified, sq flow control disable supported
# portid: 0
# trsvcid: 4420
# subnqn: nqn.2023-01.com.excloud:disk-serial-abc123
# traddr: 10.0.0.1
# Connect to it
nvme connect -t rdma -a 10.0.0.1 -s 4420 -n nqn.2023-01.com.excloud:disk-serial-abc123
# See the new block device
ls /dev/nvme*
# /dev/nvme0n1Three kernel-level things happen during the connect:
RDMA connection setup. The kernel
nvme_rdmadriver allocates an RDMA queue pair, registers memory, and does the connect. This is the same machinery the SPDK RDMA transport responds to.Fabrics Connect command. The driver sends a Connect capsule on the admin queue. The target's
nvmf_ctrlr_createat lib/nvmf/ctrlr.c:438 runs, allocates a controller, returns success.Namespace discovery. The driver sends
Identify Namespacefor NSIDs 1 through however many. For each, the target returns the namespace's identity data. The kernel creates/dev/nvme0n1(or whatever the next slot is) for each namespace.
The whole flow, with a sequence diagram
sequenceDiagram participant DB as PostgreSQL participant DE as diskengine participant RPC as JSON-RPC socket participant TGT as nvmf_tgt participant FW as SPDK framework participant Bdev as bdev/lvol module participant Sub as lib/nvmf/subsystem participant Host as Linux host DB-->>DE: GetCreatingLvols() loop for each lvol DE->>RPC: nvmf_get_transports RPC->>TGT: JSON-RPC TGT-->>RPC: [no RDMA transport] DE->>RPC: nvmf_create_transport (RDMA) RPC->>TGT: JSON-RPC TGT->>FW: spdk_nvmf_transport_create FW-->>TGT: transport object TGT-->>RPC: success DE->>RPC: nvmf_get_subsystems RPC->>TGT: JSON-RPC TGT-->>RPC: [subsystem does not exist] DE->>RPC: nvmf_create_subsystem RPC->>TGT: JSON-RPC TGT->>Sub: spdk_nvmf_subsystem_create Sub-->>TGT: subsystem object TGT-->>RPC: success DE->>RPC: nvmf_subsystem_add_listener RPC->>TGT: JSON-RPC TGT->>Sub: spdk_nvmf_subsystem_add_listener_ext Sub->>TGT: listener attached, target listening on 10.0.0.1:4420 TGT-->>RPC: success DE->>RPC: bdev_lvol_create RPC->>TGT: JSON-RPC TGT->>Bdev: spdk_lvol_create Bdev-->>TGT: lvol UUID TGT-->>RPC: lvol UUID DE->>RPC: nvmf_subsystem_add_ns RPC->>TGT: JSON-RPC TGT->>Sub: spdk_nvmf_subsystem_add_ns_ext Sub->>Bdev: spdk_bdev_open_ext_v2 Sub-->>TGT: namespace attached TGT-->>RPC: NSID DE->>DB: FinalizeProvisioningForLvol end Host->>TGT: nvme connect -t rdma -a 10.0.0.1 -s 4420 TGT->>Sub: nvmf_ctrlr_create Sub-->>TGT: controller allocated TGT-->>Host: connect OK, namespaces visible Host->>Host: /dev/nvme0n1 appears
fig. 2 The end-to-end flow. Each row in the diagram is a separate actor doing real work; each arrow is a real function call or message. Notice the symmetry: the provisioning loop is one query, four setup calls, one finalize. The host side is a separate flow that runs only when an operator (or a higher-level orchestrator) issues the connect.
Edge cases & what trips people up
Already-exists recovery
The provisioning flow can fail mid-way. The most common
case: diskengine is restarted between
BdevLvolCreate and
NvmfSubsystemAddNs. The lvol exists in SPDK,
the namespace doesn't. Next tick, BdevLvolCreate
fails with "already exists." The code handles this with
findExistingLvolUUID at
findExistingLvolUUID:182 , which
queries the bdev list for an lvol with the expected name
and returns its UUID.
The same pattern applies to NvmfSubsystemAddNs
via isNamespaceAttached at
isNamespaceAttached:159 .
The lesson: every idempotent provisioning flow needs "already exists" handling. SPDK doesn't expose a "create if not exists" RPC; the client has to detect and handle the error.
State drift
The flow can fail in a way that the database and SPDK
disagree. The most common: an lvol is created in SPDK but
the database record is still in CREATING.
On restart, the next tick tries to provision the lvol
again. The bdev already exists, so the create returns
"already exists" and the code recovers. But the namespace
isn't attached, so the attach step runs. The flow
completes; the database gets marked READY.
The bad case: the namespace attach
partially fails. The bdev exists, the subsystem exists,
the listener exists, but the namespace is in some weird
state. The provisioning loop has no way to detect this;
verifyState logs it as drift, but doesn't
fix it automatically.
The helpers.SaveConfig at the bottom of the
tick writes a JSON snapshot of the bdev and nvmf state to
SPDK_CONFIG_PATH. On nvmf_tgt restart, this
file is reloaded, and the bdevs/namespaces come back. So
the drift is "remembered" across restarts, but the
provisioning loop has to re-attach the namespace on the
next tick after the restart.
What if the host connects before the namespace is attached?
The host's nvme connect runs against the
target, the controller is created, and the host sends
Identify Namespace for NSIDs 1, 2, 3... If no
namespaces are attached, all of them return
"NSID not in use." The host sees no block devices under
/dev/nvme*. This is fine — the host just has
to wait for the next tick, or re-discover after the
namespace attach.
The protocol does not have a "namespace appearance" notification that targets can push to hosts. AEN (Asynchronous Event Notifications) can be used for namespace changes, but in practice hosts re-enumerate on a timer. So there's a natural latency: the namespace is attached, the host's next poll picks it up.
What if the disk's NQN changes?
The NQN is derived from the disk serial. If the disk is replaced, the serial changes, the NQN changes, the subsystem changes. The old subsystem is still in SPDK but has no lvols. The new subsystem is created. The host's connection to the old subsystem continues until the host's controller times out (keep-alive) or the host explicitly disconnects.
This is a feature, not a bug: the old lvol is still
addressable until the host lets go of it. The cost is
stale subsystems cluttering up the target; a
nvmf_subsystem_remove RPC after
disconnect would be cleaner, but diskengine doesn't
do this automatically.
What to take away
The provisioning flow is a sequence of idempotent "ensure-or-create" calls, all of which go through the JSON-RPC socket to nvmf_tgt. The state machine on the diskengine side is "CREATING → READY," with the work happening on a per-tick loop. The state machine on the SPDK side is "subsystem doesn't exist → create → listener attached → namespace attached." Every step has an "already exists" recovery path. The host's connection is decoupled from the provisioning; once the subsystem is listening, hosts can connect (or reconnect) at their own pace.
This is the end of Layer 6. The next layer is vhost and virtio — the paravirtualized cousin of NVMe-oF for VMs that don't need full NVMe semantics.