What diskengine actually does when it touches an lvol.
This is the marquee page of Layer 5. We trace three
diskengine loops that touch lvols, end to end, from the
outermost Go tick to the innermost C function. The loops
are: provisionLvol (creating a new lvol),
processInPlaceResizes (growing an existing
lvol), and processDeletingSnapshots (deleting
a snapshot and reconciling the lvstore's free-cluster
count). At the end of each trace you'll know exactly which
bytes move on the wire, which C function runs, and which
DB write finalizes the operation.
The orchestrator: provisioningLoop ticks every second
Trace 1: processProvisioning
Trace 1: provisionLvol
Trace 1: ensureNvmeofReady
Trace 1: BdevLvolCreate — Go to C
Trace 1: rpc_bdev_lvol_create — the C handler
Trace 1: vbdev_lvol_create — into the lvol layer
Trace 1: the cluster allocation and metadata write
Trace 1: NvmfSubsystemAddNs — the response path
Trace 1: verifyState does a parallel check
Trace 2: processInPlaceResizes
Trace 3: processDeletingSnapshots
Edge cases & what trips people up
The orchestrator: provisioningLoop ticks every second
Three diskengine goroutines drive lvol state. They all run
on the storagenode and tick on a 1 Hz time.Ticker:
Goroutine
File
What it does
provisioningLoop
provisionlvol.go
Reads DB rows in CREATING state, calls bdev_lvol_create, attaches the lvol to an NVMe-oF subsystem, marks the DB row READY.
inPlaceResizeLoop
resize.go
Reads DB rows in RESIZING state, computes the target size, calls bdev_lvol_resize, zeros the new region.
snapshotDeleteLoop
snapshotdelete.go
Reads DB rows in DELETING state, calls bdev_lvol_delete, finalizes the row, re-syncs the lvstore's free-cluster count from SPDK.
provisioningLoop also runs
verifyState in the same tick. That means the
create path is also the
reconciliation path — every second, the system
either makes progress or notices drift.
Trace 1: processProvisioning
The DB read. The query returns every lvol on this baremetal
that's in CREATING state. The loop iterates
them, calling provisionLvol on each.
Trace 1: provisionLvol
The per-lvol work. Six steps: validate placement, ensure
NVMe-oF, build the params, call bdev_lvol_create,
attach to the subsystem, finalize the DB row.
Trace 1: ensureNvmeofReady
Before creating the lvol, the NVMe-oF transport, subsystem,
and listener must exist. The check is "create-if-missing"
— idempotent, safe to call on every tick. The function is
in
ensureNvmeofReady:13.
The subsystem check (lines 55-93) does the same dance for
the NQN: query, find or create, then refresh and locate.
The listener check (lines 94-118) iterates the list of
IPs the disk's RDMA interface is bound to and creates a
listener on each. The full function makes 1-3 RPCs in the
common case, more if multiple IPs are configured.
Trace 1: BdevLvolCreate — Go to C
The Go wrapper is in
BdevLvolCreate:42.
The full trace of how the call lands in the C handler is in
3.2 — bdev_lvol_create
end-to-end. The short version:
STEP 01
Go client encodes
json.Marshal of BdevLvolCreateParams
→
STEP 02
Unix socket write
one JSON object + newline
→
STEP 03
SPDK recv
spdk_jsonrpc_server_poll pulls the bytes
→
STEP 04
JSON parse
two-pass: dry run + real decode
→
STEP 05
Method lookup
linear scan of g_rpc_methods
→
STEP 06
Handler dispatch
rpc_bdev_lvol_create runs
The Go-side payload is the struct from
BdevLvolCreateParams:90:
The "uuid" field is the lvstore's UUID — diskengine
always addresses the lvstore by UUID, never by name. The
"lvol_name" is the integer lvol ID rendered as a string.
The "size_in_mib" is the rounded-down MiB count (any bytes
past the cluster boundary are silently dropped).
Trace 1: rpc_bdev_lvol_create — the C handler
The C handler is at
module/bdev/lvol/vbdev_lvol_rpc.c:332.
The body decodes, looks up, translates, dispatches, and
queues the response:
Trace 1: vbdev_lvol_create — into the lvol layer
The bridge from the bdev module to the lvol subsystem is
at module/bdev/lvol/vbdev_lvol.c:1232.
It allocates a request wrapper and calls the lvol library:
Trace 1: the cluster allocation and metadata write
spdk_lvol_create
( lib/lvol/lvol.c:1173) sets up
the blob options:
The actual cluster allocation and metadata write happens
inside spdk_bs_create_blob_ext. For a thick
lvol, this is a synchronous-feeling sequence of:
Allocate a blob ID (find a clear bit in
used_blobids).
For each cluster in num_clusters: find a
free cluster in used_clusters, then write
the new extent table entry.
Persist the metadata pages (one md_page write per
page that's been dirtied).
Mark the lvstore clean (if all writes succeeded).
For a thin lvol, step 2 is skipped — the extent table is
just zeros — but the blob ID allocation and the metadata
write still happen. The "thick vs. thin" distinction is
purely about whether num_clusters worth of
free clusters were taken from the pool.
Once the blob is created, the completion callback
lvol_create_cb
( lib/lvol/lvol.c:1050) opens the
blob, attaches it to the lvol struct, and returns control
up the chain. The bdev layer's
_vbdev_lvol_create_cb
( module/bdev/lvol/vbdev_lvol.c:1216)
then calls _create_lvol_disk
(line 1131), which builds the spdk_bdev and
registers it via spdk_bdev_register. The
bdev's name is the lvol UUID; the alias is
"<lvstore_name>/<lvol_name>".
Once the bdev is registered, the lvol layer's
_vbdev_lvol_create_cb invokes the original
RPC callback rpc_bdev_lvol_create_cb
( module/bdev/lvol/vbdev_lvol_rpc.c:312).
That callback writes the response:
Trace 1: NvmfSubsystemAddNs — the response path
Back in Go, the response is at
NvmfSubsystemAddNs:114.
The call attaches the lvol bdev to the NVMe-oF subsystem
that was set up by ensureNvmeofReady. The
handler is in the nvmf RPC layer; the lvol layer is not
involved.
The final step is
repository.FinalizeProvisioningForLvol, which
updates the DB row from CREATING to
READY. After this, the next tick's
processProvisioning query won't return this
lvol. The trace is done.
Trace 1: verifyState does a parallel check
On the same tick (back in provisioningLoop),
verifyState runs. It cross-references the DB
against SPDK's view of the world:
Trace 2: processInPlaceResizes
The resize loop. A separate goroutine on a 1 Hz tick. The
DB returns lvols in RESIZING state with a
new capacity. The loop computes the target size, issues
bdev_lvol_resize, and (importantly) zeros
the delta region.
The C-side handler is
rpc_bdev_lvol_resize
( module/bdev/lvol/vbdev_lvol_rpc.c:846).
The chain is
vbdev_lvol_resize → spdk_lvol_resize
→ spdk_blob_resize. The blobstore allocates
the new clusters (or extends the extent table for thin
lvols), updates the blob's metadata, and on completion
notifies the bdev layer via
spdk_bdev_notify_blockcnt_change
( module/bdev/lvol/vbdev_lvol.c:1406).
The framework then re-evaluates the bdev's blockcnt and
notifies any desc holders.
Trace 3: processDeletingSnapshots
The snapshot-delete loop. A separate goroutine on a 1 Hz
tick. The DB returns snapshots in DELETING
state on this node.
The C-side handler is
rpc_bdev_lvol_delete
( module/bdev/lvol/vbdev_lvol_rpc.c:995),
which calls
vbdev_lvol_destroy
( module/bdev/lvol/vbdev_lvol.c:690).
That function calls
_vbdev_lvol_destroy
(line 650), which checks for clones and then calls
spdk_bdev_unregister. The bdev unregister
fires vbdev_lvol_unregister
(line 615), which calls
spdk_lvol_close. The lvol close path then
triggers the blob's destroy, which releases the clusters
back to the pool and updates FreeClusters.
The free-cluster re-sync is the part diskengine cares
about: after the delete, the lvstore's free pool is bigger
by however many clusters the snapshot owned. The
spdkFree calculation at line 95 is the same
FreeClusters * ClusterSize formula we saw in
the lvstore-metadata page, just applied here to a
post-delete state.
Edge cases & what trips people up
"Already exists" recovery: this is the path you want to hit
The
isAlreadyExistsErr check in
isAlreadyExistsErr:101
exists for a reason: SPDK may have created the lvol, the
Go process may have crashed before marking the DB row
READY, and the next tick would otherwise
re-create. The recovery flow
( findExistingLvolUUID:103)
queries SPDK for the lvol by name and uses its UUID. The
fact that this is in the hot path is a feature, not a bug
— it makes the loop crash-safe.
Lvstore not loaded: defer the snapshot delete
If the lvol's lvstore isn't loaded in SPDK yet (e.g. after
an SPDK restart that hasn't finished examining all
bdevs), the snapshot delete would say "not found" but
the snapshot might still exist on disk. The
loadedLvstores set
( loadedLvstores:49)
prevents the diskengine from finalizing the deletion in
that case. The delete retries on the next tick.
Concurrent provision + verify on the same tick
provisioningLoop calls
processProvisioning and
verifyState in the same tick. The verify
state is just-after-the-create, which means it sees the
newly-created lvol in SPDK and the still-CREATING
row in the DB. The verifier doesn't check the
CREATING state — it only checks
READY lvols (see
GetReadyLvols:24).
So a create-in-progress doesn't trigger a false drift
report. The create-then-finalize window is one function
call, not a tick.
Resize with an in-progress NBD zero
If the resize succeeds but the NBD-based zeroing fails
(e.g. dd is killed), the lvol is in a
halfway state: it has the new size, but the delta region
is uninitialized. The
processInPlaceResizes loop returns
without calling
repository.MarkLvolUpByUUID, so the DB row
stays in RESIZING state and the resize
retries on the next tick. The next tick reads the
current SPDK size (which already includes the
delta), and the "already at or above requested size" check
at
cur.bytes >= want:81
makes it skip the resize — but the zero is still missing.
The cleanest fix is to record in the DB that the resize
succeeded but the zero didn't, and re-zero on the next
tick. Diskengine currently doesn't track this state
separately; it logs the failure and hopes the next
restart catches it via the verification loop.
RPC failure mid-create: the lvol is orphaned
If bdev_lvol_create succeeds but the
subsequent NvmfSubsystemAddNs fails for a
reason other than "already exists", the lvol is created
in SPDK but not exposed via NVMe-oF. The log at
orphan bdev warning:132
is the operator's hint. The cleanup is manual
(bdev_lvol_delete by hand, or
verifyState will report it on the next tick).
What to take away
Three goroutines, three 1 Hz loops, three end-to-end
traces. The create path goes Go → JSON-RPC →
rpc_bdev_lvol_create → vbdev_lvol_create
→ spdk_lvol_create → blobstore. The resize
path adds a zeroing step (via NBD + dd) that
is the slowest part of the operation. The delete path
re-syncs the lvstore's free-cluster count from SPDK
after the lvol is destroyed. verifyState
runs on the same tick as the create path and is the
safety net for all three: every discrepancy between DB
and SPDK is reported, and the capacity / free-bytes
counters are updated to match SPDK's view. The whole
subsystem is designed to make a 1-second-tick, eventually
consistent system that converges in the face of crashes
and partial failures.
You're now at the end of Layer 5. The next layer is
6.1 — NVMe-oF concepts,
which is about how those lvols get exposed over the
network.
Diagram
Pinch / ⌘+scroll to zoom · drag to pan · Esc to close