Layer 6 · nvmf

From a Go SELECT to a kernel /dev/nvmeXnY.

The provisioning flow is the bread and butter of diskengine. A row in the lvols table with state CREATING gets picked up, an lvol is created on a lvstore, the lvol becomes a namespace on a subsystem, the subsystem listens on RDMA, and a remote host connects and sees a new block device. This page traces every step with real code.

~20 min read2 diagramsprerequisites: 5.4 · 6.1 · 6.2 · 6.3
On this page
  1. The provisioning loop, top of the file
  2. processProvisioning — fetching the work
  3. provisionLvol — the per-lvol driver
  4. ensureNvmeofReady — transport, subsystem, listener
  5. BdevLvolCreate → SPDK: rpc_bdev_lvol_create
  6. NvmfSubsystemAddNs → SPDK: rpc_nvmf_subsystem_add_ns
  7. Listener add: spdk_nvmf_subsystem_add_listener_ext
  8. Host side: kernel NVMe discover, kernel connect
  9. The whole flow, with a sequence diagram
  10. Edge cases: already-exists, partial cleanup, state drift

The provisioning loop, top of the file

The provisioning flow is driven by a goroutine that wakes every second. It lives in provisioningLoop:16 . The job: scan the database for lvols in state CREATING, do whatever needs doing to make them visible over NVMe-oF, mark them READY when done.

Three things happen each tick:

  1. processProvisioning — the actual work. Read below.

  2. verifyState — compare what the database says to what SPDK has, log drift. Drift is normal (it happens on restart when SPDK has lvols the database doesn't know about yet) but excessive drift is a bug.

  3. SaveConfig — write a JSON snapshot of bdev and nvmf state to SPDK_CONFIG_PATH. This is what makes the configuration survive an nvmf_tgt restart. The file is consumed by nvmf_tgt's -c flag at next startup.

processProvisioning — fetching the work

The first thing processProvisioning does is query the database. The query: "give me all lvols in state CREATING for this bare metal node." The result is a slice of repository.LvolToProcess.

The loop is "fail-soft" — a failure on one lvol does not stop the rest. This is important because the work involves multiple JSON-RPC calls to a separate process (nvmf_tgt), and any one of them can fail transiently.

provisionLvol — the per-lvol driver

For each lvol, the work is a fixed sequence. The function provisionLvol at provisionLvol:74 is the orchestrator. It doesn't do any SPDK work itself; it just calls the right methods on spdkClient in the right order.

flowchart TB
Start([Start: lvol in CREATING state]) --> Validate["Validate NQN, RDMA IP, port are set"]
Validate --> EnsureReady["ensureNvmeofReady
(transport + subsystem + listener)"] EnsureReady --> BdevCreate["BdevLvolCreate
(creates the lvol)"] BdevCreate --> AddNs["NvmfSubsystemAddNs
(attaches lvol to subsystem)"] AddNs --> Finalize["FinalizeProvisioningForLvol
(mark lvol READY in DB)"] Finalize --> Done([End: lvol is READY])
fig. 1 — provisionLvol, top-level sequence · tap or scroll to zoom · ↗ for fullscreen

fig. 1   The four steps. Each one is a single JSON-RPC call. A failure in any one returns an error; the next tick retries from the same place (the lvol is still in CREATING).

The validation step is a guard. An lvol in CREATING state without placement information is a corrupted record; we refuse to touch it. The error is logged and the lvol stays in CREATING for human inspection.

ensureNvmeofReady — transport, subsystem, listener

The ensureNvmeofReady function in ensureNvmeofReady:13 is the "idempotent setup" routine. It checks what's already in SPDK and creates whatever's missing. The pattern: read state, do a comparison, call the create only if needed.

The three things being ensured, in order:

  1. An RDMA transport. The nvmf_create_transport RPC creates the transport object inside the target. It's a one-time setup; once the transport exists, all subsystems can add listeners on it. The transport options (ioUnitSize = 16384, numShared = 1024, maxQueueDepth = 256, etc.) are tuned for diskengine's expected workload.

  2. A subsystem with the lvol's NQN. The nvmf_create_subsystem RPC creates the subsystem and starts it. Note the AllowAnyHost: true — diskengine doesn't enforce host NQN allowlists at this layer; it relies on the orchestrator to place authorized hosts. Production deployments would set this to false and use nvmf_subsystem_add_host explicitly.

  3. A listener on the right address. The nvmf_subsystem_add_listener RPC, with trtype = RDMA, the IP, and the port. This is what makes the subsystem actually reachable.

BdevLvolCreate → SPDK: rpc_bdev_lvol_create

With the subsystem ready, the next call creates the lvol on the lvstore. This is the call that produces the lvol bdev that the namespace will be attached to. The wrapper is at BdevLvolCreate:42 .

The call sends a JSON-RPC request over the Unix socket. The request is a struct with the lvstore UUID, the lvol name (a numeric ID, derived from lvol.LvolID), the size in MiB, the clear method, and a thin-provisioning flag. The transport is JSON over a Unix socket — see 3.1 for the wire format details.

On the SPDK side, the RPC handler is rpc_bdev_lvol_create in lib/lvol/lvol_rpc.c:rpc_bdev_lvol_create . The handler is in the lvol library, not the nvmf library, but it's invoked through the same JSON-RPC dispatch. The work is: validate the params, find the lvstore, call spdk_lvol_create, and respond with the new lvol's UUID.

The response back to diskengine is the lvol's UUID. This UUID is what we'll attach to the subsystem in the next call. The lvol name visible in SPDK is <lvstore_name>/<lvol_name> — for example, lvstore-0/1234.

NvmfSubsystemAddNs → SPDK: rpc_nvmf_subsystem_add_ns

The lvol exists. Now we expose it. The call is NvmfSubsystemAddNs, with the subsystem's NQN and the new lvol's name.

The "already exists" branch is critical. It handles the case where the previous attempt got partway through: the lvol was created but the namespace attach failed, or the process crashed between the two calls. On the next tick, BdevLvolCreate would fail (lvol already exists), and then NvmfSubsystemAddNs would fail (namespace already exists). The isAlreadyExistsErr helper detects this and recovers by verifying the actual state.

On the SPDK side, the call goes to rpc_nvmf_subsystem_add_ns at lib/nvmf/nvmf_rpc.c:1552 . The handler validates that the subsystem is in INACTIVE or PAUSED state, parses the bdev name, then calls spdk_nvmf_subsystem_add_ns_ext. That function at lib/nvmf/subsystem.c:2229 does the real work: spdk_bdev_open_ext_v2 on the bdev, allocate an NSID, populate the namespace struct.

Listener add: spdk_nvmf_subsystem_add_listener_ext

We've already covered this in ensureNvmeofReady — but it's worth re-stating the call chain, because this is what makes the subsystem actually reachable.

The order is important. The subsystem is paused before the listener is added, and resumed after. This means: no new connections arrive while the listener is in flux, and existing connections are not touched. The pause is short — milliseconds — and is invisible to the host.

Host side: kernel NVMe discover, kernel connect

Once the target is listening, the host can connect. The typical flow on a Linux host:

# Discover what's available
nvme discover -t rdma -a 10.0.0.1 -s 4420

# Output:
# Discovery Log Number of Records: 1, Generation counter: 1
# =====Discovery Log Entry 0======
# trtype:  rdma
# adrfam:  ipv4
# subtype: nvme subsystem
# treq:    not specified, sq flow control disable supported
# portid:  0
# trsvcid: 4420
# subnqn:  nqn.2023-01.com.excloud:disk-serial-abc123
# traddr:  10.0.0.1

# Connect to it
nvme connect -t rdma -a 10.0.0.1 -s 4420 -n nqn.2023-01.com.excloud:disk-serial-abc123

# See the new block device
ls /dev/nvme*
# /dev/nvme0n1

Three kernel-level things happen during the connect:

  1. RDMA connection setup. The kernel nvme_rdma driver allocates an RDMA queue pair, registers memory, and does the connect. This is the same machinery the SPDK RDMA transport responds to.

  2. Fabrics Connect command. The driver sends a Connect capsule on the admin queue. The target's nvmf_ctrlr_create at lib/nvmf/ctrlr.c:438 runs, allocates a controller, returns success.

  3. Namespace discovery. The driver sends Identify Namespace for NSIDs 1 through however many. For each, the target returns the namespace's identity data. The kernel creates /dev/nvme0n1 (or whatever the next slot is) for each namespace.

The whole flow, with a sequence diagram

sequenceDiagram
participant DB as PostgreSQL
participant DE as diskengine
participant RPC as JSON-RPC socket
participant TGT as nvmf_tgt
participant FW as SPDK framework
participant Bdev as bdev/lvol module
participant Sub as lib/nvmf/subsystem
participant Host as Linux host

DB-->>DE: GetCreatingLvols()
loop for each lvol
  DE->>RPC: nvmf_get_transports
  RPC->>TGT: JSON-RPC
  TGT-->>RPC: [no RDMA transport]
  DE->>RPC: nvmf_create_transport (RDMA)
  RPC->>TGT: JSON-RPC
  TGT->>FW: spdk_nvmf_transport_create
  FW-->>TGT: transport object
  TGT-->>RPC: success
  DE->>RPC: nvmf_get_subsystems
  RPC->>TGT: JSON-RPC
  TGT-->>RPC: [subsystem does not exist]
  DE->>RPC: nvmf_create_subsystem
  RPC->>TGT: JSON-RPC
  TGT->>Sub: spdk_nvmf_subsystem_create
  Sub-->>TGT: subsystem object
  TGT-->>RPC: success
  DE->>RPC: nvmf_subsystem_add_listener
  RPC->>TGT: JSON-RPC
  TGT->>Sub: spdk_nvmf_subsystem_add_listener_ext
  Sub->>TGT: listener attached, target listening on 10.0.0.1:4420
  TGT-->>RPC: success
  DE->>RPC: bdev_lvol_create
  RPC->>TGT: JSON-RPC
  TGT->>Bdev: spdk_lvol_create
  Bdev-->>TGT: lvol UUID
  TGT-->>RPC: lvol UUID
  DE->>RPC: nvmf_subsystem_add_ns
  RPC->>TGT: JSON-RPC
  TGT->>Sub: spdk_nvmf_subsystem_add_ns_ext
  Sub->>Bdev: spdk_bdev_open_ext_v2
  Sub-->>TGT: namespace attached
  TGT-->>RPC: NSID
  DE->>DB: FinalizeProvisioningForLvol
end
Host->>TGT: nvme connect -t rdma -a 10.0.0.1 -s 4420
TGT->>Sub: nvmf_ctrlr_create
Sub-->>TGT: controller allocated
TGT-->>Host: connect OK, namespaces visible
Host->>Host: /dev/nvme0n1 appears
fig. 2 — the end-to-end provisioning flow · tap or scroll to zoom · ↗ for fullscreen

fig. 2   The end-to-end flow. Each row in the diagram is a separate actor doing real work; each arrow is a real function call or message. Notice the symmetry: the provisioning loop is one query, four setup calls, one finalize. The host side is a separate flow that runs only when an operator (or a higher-level orchestrator) issues the connect.

Edge cases & what trips people up

Already-exists recovery

The provisioning flow can fail mid-way. The most common case: diskengine is restarted between BdevLvolCreate and NvmfSubsystemAddNs. The lvol exists in SPDK, the namespace doesn't. Next tick, BdevLvolCreate fails with "already exists." The code handles this with findExistingLvolUUID at findExistingLvolUUID:182 , which queries the bdev list for an lvol with the expected name and returns its UUID.

The same pattern applies to NvmfSubsystemAddNs via isNamespaceAttached at isNamespaceAttached:159 .

The lesson: every idempotent provisioning flow needs "already exists" handling. SPDK doesn't expose a "create if not exists" RPC; the client has to detect and handle the error.

State drift

The flow can fail in a way that the database and SPDK disagree. The most common: an lvol is created in SPDK but the database record is still in CREATING. On restart, the next tick tries to provision the lvol again. The bdev already exists, so the create returns "already exists" and the code recovers. But the namespace isn't attached, so the attach step runs. The flow completes; the database gets marked READY.

The bad case: the namespace attach partially fails. The bdev exists, the subsystem exists, the listener exists, but the namespace is in some weird state. The provisioning loop has no way to detect this; verifyState logs it as drift, but doesn't fix it automatically.

The helpers.SaveConfig at the bottom of the tick writes a JSON snapshot of the bdev and nvmf state to SPDK_CONFIG_PATH. On nvmf_tgt restart, this file is reloaded, and the bdevs/namespaces come back. So the drift is "remembered" across restarts, but the provisioning loop has to re-attach the namespace on the next tick after the restart.

What if the host connects before the namespace is attached?

The host's nvme connect runs against the target, the controller is created, and the host sends Identify Namespace for NSIDs 1, 2, 3... If no namespaces are attached, all of them return "NSID not in use." The host sees no block devices under /dev/nvme*. This is fine — the host just has to wait for the next tick, or re-discover after the namespace attach.

The protocol does not have a "namespace appearance" notification that targets can push to hosts. AEN (Asynchronous Event Notifications) can be used for namespace changes, but in practice hosts re-enumerate on a timer. So there's a natural latency: the namespace is attached, the host's next poll picks it up.

What if the disk's NQN changes?

The NQN is derived from the disk serial. If the disk is replaced, the serial changes, the NQN changes, the subsystem changes. The old subsystem is still in SPDK but has no lvols. The new subsystem is created. The host's connection to the old subsystem continues until the host's controller times out (keep-alive) or the host explicitly disconnects.

This is a feature, not a bug: the old lvol is still addressable until the host lets go of it. The cost is stale subsystems cluttering up the target; a nvmf_subsystem_remove RPC after disconnect would be cleaner, but diskengine doesn't do this automatically.

What to take away

The provisioning flow is a sequence of idempotent "ensure-or-create" calls, all of which go through the JSON-RPC socket to nvmf_tgt. The state machine on the diskengine side is "CREATING → READY," with the work happening on a per-tick loop. The state machine on the SPDK side is "subsystem doesn't exist → create → listener attached → namespace attached." Every step has an "already exists" recovery path. The host's connection is decoupled from the provisioning; once the subsystem is listening, hosts can connect (or reconnect) at their own pace.

This is the end of Layer 6. The next layer is vhost and virtio — the paravirtualized cousin of NVMe-oF for VMs that don't need full NVMe semantics.