Layer 5 · lvol

What diskengine actually does when it touches an lvol.

This is the marquee page of Layer 5. We trace three diskengine loops that touch lvols, end to end, from the outermost Go tick to the innermost C function. The loops are: provisionLvol (creating a new lvol), processInPlaceResizes (growing an existing lvol), and processDeletingSnapshots (deleting a snapshot and reconciling the lvstore's free-cluster count). At the end of each trace you'll know exactly which bytes move on the wire, which C function runs, and which DB write finalizes the operation.

~20 min read1 sequence diagramprerequisites: 5.1 lvstore · 5.2 thin + snapshots · 5.3 CoW · 3.2 bdev_lvol_create end-to-end
On this page
  1. The orchestrator: provisioningLoop ticks every second
  2. Trace 1: processProvisioning
  3. Trace 1: provisionLvol
  4. Trace 1: ensureNvmeofReady
  5. Trace 1: BdevLvolCreate — Go to C
  6. Trace 1: rpc_bdev_lvol_create — the C handler
  7. Trace 1: vbdev_lvol_create — into the lvol layer
  8. Trace 1: the cluster allocation and metadata write
  9. Trace 1: NvmfSubsystemAddNs — the response path
  10. Trace 1: verifyState does a parallel check
  11. Trace 2: processInPlaceResizes
  12. Trace 3: processDeletingSnapshots
  13. Edge cases & what trips people up

The orchestrator: provisioningLoop ticks every second

Three diskengine goroutines drive lvol state. They all run on the storagenode and tick on a 1 Hz time.Ticker:

GoroutineFileWhat it does
provisioningLoopprovisionlvol.goReads DB rows in CREATING state, calls bdev_lvol_create, attaches the lvol to an NVMe-oF subsystem, marks the DB row READY.
inPlaceResizeLoopresize.goReads DB rows in RESIZING state, computes the target size, calls bdev_lvol_resize, zeros the new region.
snapshotDeleteLoopsnapshotdelete.goReads DB rows in DELETING state, calls bdev_lvol_delete, finalizes the row, re-syncs the lvstore's free-cluster count from SPDK.

provisioningLoop also runs verifyState in the same tick. That means the create path is also the reconciliation path — every second, the system either makes progress or notices drift.

Trace 1: processProvisioning

The DB read. The query returns every lvol on this baremetal that's in CREATING state. The loop iterates them, calling provisionLvol on each.

Trace 1: provisionLvol

The per-lvol work. Six steps: validate placement, ensure NVMe-oF, build the params, call bdev_lvol_create, attach to the subsystem, finalize the DB row.

Trace 1: ensureNvmeofReady

Before creating the lvol, the NVMe-oF transport, subsystem, and listener must exist. The check is "create-if-missing" — idempotent, safe to call on every tick. The function is in ensureNvmeofReady:13 .

The subsystem check (lines 55-93) does the same dance for the NQN: query, find or create, then refresh and locate. The listener check (lines 94-118) iterates the list of IPs the disk's RDMA interface is bound to and creates a listener on each. The full function makes 1-3 RPCs in the common case, more if multiple IPs are configured.

Trace 1: BdevLvolCreate — Go to C

The Go wrapper is in BdevLvolCreate:42 . The full trace of how the call lands in the C handler is in 3.2 — bdev_lvol_create end-to-end. The short version:

STEP 01
Go client encodes
json.Marshal of BdevLvolCreateParams
STEP 02
Unix socket write
one JSON object + newline
STEP 03
SPDK recv
spdk_jsonrpc_server_poll pulls the bytes
STEP 04
JSON parse
two-pass: dry run + real decode
STEP 05
Method lookup
linear scan of g_rpc_methods
STEP 06
Handler dispatch
rpc_bdev_lvol_create runs

The Go-side payload is the struct from BdevLvolCreateParams:90 :

{
  "lvol_name":      "42",
  "size_in_mib":    1024,
  "thin_provision": true,
  "clear_method":   "unmap",
  "uuid":           "bd56a4e6-..."
}

The "uuid" field is the lvstore's UUID — diskengine always addresses the lvstore by UUID, never by name. The "lvol_name" is the integer lvol ID rendered as a string. The "size_in_mib" is the rounded-down MiB count (any bytes past the cluster boundary are silently dropped).

Trace 1: rpc_bdev_lvol_create — the C handler

The C handler is at module/bdev/lvol/vbdev_lvol_rpc.c:332 . The body decodes, looks up, translates, dispatches, and queues the response:

Trace 1: vbdev_lvol_create — into the lvol layer

The bridge from the bdev module to the lvol subsystem is at module/bdev/lvol/vbdev_lvol.c:1232 . It allocates a request wrapper and calls the lvol library:

Trace 1: the cluster allocation and metadata write

spdk_lvol_create ( lib/lvol/lvol.c:1173 ) sets up the blob options:

The actual cluster allocation and metadata write happens inside spdk_bs_create_blob_ext. For a thick lvol, this is a synchronous-feeling sequence of:

  1. Allocate a blob ID (find a clear bit in used_blobids).
  2. For each cluster in num_clusters: find a free cluster in used_clusters, then write the new extent table entry.
  3. Persist the metadata pages (one md_page write per page that's been dirtied).
  4. Mark the lvstore clean (if all writes succeeded).

For a thin lvol, step 2 is skipped — the extent table is just zeros — but the blob ID allocation and the metadata write still happen. The "thick vs. thin" distinction is purely about whether num_clusters worth of free clusters were taken from the pool.

Once the blob is created, the completion callback lvol_create_cb ( lib/lvol/lvol.c:1050 ) opens the blob, attaches it to the lvol struct, and returns control up the chain. The bdev layer's _vbdev_lvol_create_cb ( module/bdev/lvol/vbdev_lvol.c:1216 ) then calls _create_lvol_disk (line 1131), which builds the spdk_bdev and registers it via spdk_bdev_register. The bdev's name is the lvol UUID; the alias is "<lvstore_name>/<lvol_name>".

Once the bdev is registered, the lvol layer's _vbdev_lvol_create_cb invokes the original RPC callback rpc_bdev_lvol_create_cb ( module/bdev/lvol/vbdev_lvol_rpc.c:312 ). That callback writes the response:

Trace 1: NvmfSubsystemAddNs — the response path

Back in Go, the response is at NvmfSubsystemAddNs:114 . The call attaches the lvol bdev to the NVMe-oF subsystem that was set up by ensureNvmeofReady. The handler is in the nvmf RPC layer; the lvol layer is not involved.

The final step is repository.FinalizeProvisioningForLvol, which updates the DB row from CREATING to READY. After this, the next tick's processProvisioning query won't return this lvol. The trace is done.

Trace 1: verifyState does a parallel check

On the same tick (back in provisioningLoop), verifyState runs. It cross-references the DB against SPDK's view of the world:

Trace 2: processInPlaceResizes

The resize loop. A separate goroutine on a 1 Hz tick. The DB returns lvols in RESIZING state with a new capacity. The loop computes the target size, issues bdev_lvol_resize, and (importantly) zeros the delta region.

The C-side handler is rpc_bdev_lvol_resize ( module/bdev/lvol/vbdev_lvol_rpc.c:846 ). The chain is vbdev_lvol_resizespdk_lvol_resizespdk_blob_resize. The blobstore allocates the new clusters (or extends the extent table for thin lvols), updates the blob's metadata, and on completion notifies the bdev layer via spdk_bdev_notify_blockcnt_change ( module/bdev/lvol/vbdev_lvol.c:1406 ). The framework then re-evaluates the bdev's blockcnt and notifies any desc holders.

Trace 3: processDeletingSnapshots

The snapshot-delete loop. A separate goroutine on a 1 Hz tick. The DB returns snapshots in DELETING state on this node.

The C-side handler is rpc_bdev_lvol_delete ( module/bdev/lvol/vbdev_lvol_rpc.c:995 ), which calls vbdev_lvol_destroy ( module/bdev/lvol/vbdev_lvol.c:690 ). That function calls _vbdev_lvol_destroy (line 650), which checks for clones and then calls spdk_bdev_unregister. The bdev unregister fires vbdev_lvol_unregister (line 615), which calls spdk_lvol_close. The lvol close path then triggers the blob's destroy, which releases the clusters back to the pool and updates FreeClusters.

The free-cluster re-sync is the part diskengine cares about: after the delete, the lvstore's free pool is bigger by however many clusters the snapshot owned. The spdkFree calculation at line 95 is the same FreeClusters * ClusterSize formula we saw in the lvstore-metadata page, just applied here to a post-delete state.

Edge cases & what trips people up

"Already exists" recovery: this is the path you want to hit

The isAlreadyExistsErr check in

isAlreadyExistsErr:101

exists for a reason: SPDK may have created the lvol, the Go process may have crashed before marking the DB row READY, and the next tick would otherwise re-create. The recovery flow ( findExistingLvolUUID:103 ) queries SPDK for the lvol by name and uses its UUID. The fact that this is in the hot path is a feature, not a bug — it makes the loop crash-safe.

Lvstore not loaded: defer the snapshot delete

If the lvol's lvstore isn't loaded in SPDK yet (e.g. after an SPDK restart that hasn't finished examining all bdevs), the snapshot delete would say "not found" but the snapshot might still exist on disk. The loadedLvstores set ( loadedLvstores:49 ) prevents the diskengine from finalizing the deletion in that case. The delete retries on the next tick.

Concurrent provision + verify on the same tick

provisioningLoop calls processProvisioning and verifyState in the same tick. The verify state is just-after-the-create, which means it sees the newly-created lvol in SPDK and the still-CREATING row in the DB. The verifier doesn't check the CREATING state — it only checks READY lvols (see GetReadyLvols:24 ). So a create-in-progress doesn't trigger a false drift report. The create-then-finalize window is one function call, not a tick.

Resize with an in-progress NBD zero

If the resize succeeds but the NBD-based zeroing fails (e.g. dd is killed), the lvol is in a halfway state: it has the new size, but the delta region is uninitialized. The processInPlaceResizes loop returns without calling repository.MarkLvolUpByUUID, so the DB row stays in RESIZING state and the resize retries on the next tick. The next tick reads the current SPDK size (which already includes the delta), and the "already at or above requested size" check at

cur.bytes >= want:81

makes it skip the resize — but the zero is still missing. The cleanest fix is to record in the DB that the resize succeeded but the zero didn't, and re-zero on the next tick. Diskengine currently doesn't track this state separately; it logs the failure and hopes the next restart catches it via the verification loop.

RPC failure mid-create: the lvol is orphaned

If bdev_lvol_create succeeds but the subsequent NvmfSubsystemAddNs fails for a reason other than "already exists", the lvol is created in SPDK but not exposed via NVMe-oF. The log at

orphan bdev warning:132

is the operator's hint. The cleanup is manual (bdev_lvol_delete by hand, or verifyState will report it on the next tick).

What to take away

Three goroutines, three 1 Hz loops, three end-to-end traces. The create path goes Go → JSON-RPC → rpc_bdev_lvol_createvbdev_lvol_createspdk_lvol_create → blobstore. The resize path adds a zeroing step (via NBD + dd) that is the slowest part of the operation. The delete path re-syncs the lvstore's free-cluster count from SPDK after the lvol is destroyed. verifyState runs on the same tick as the create path and is the safety net for all three: every discrepancy between DB and SPDK is reported, and the capacity / free-bytes counters are updated to match SPDK's view. The whole subsystem is designed to make a 1-second-tick, eventually consistent system that converges in the face of crashes and partial failures.

You're now at the end of Layer 5. The next layer is 6.1 — NVMe-oF concepts, which is about how those lvols get exposed over the network.