Layer 5 · lvol

The tree that lives on top of the cluster pool.

The lvstore is a cluster pool. It has clusters; some are allocated, some are free. The lvols are the consumers of the cluster pool, but they're more than that — they form a tree. A thick lvol allocates its own clusters. A thin lvol allocates on first write. A snapshot shares its parent's clusters until a write forces a copy-on-write. A clone is a child snapshot that the host writes to. This page is about that tree: how it's built, how the parent-child relationship is recorded, and what diskengine does with it.

~15 min read2 diagramsprerequisites: 5.1 lvstore + metadata · 4.4 hierarchy
On this page
  1. What "thin" actually means
  2. The lvol tree: lvstore, parent, snapshot, clone
  3. How bdev_lvol_snapshot works
  4. How bdev_lvol_clone works
  5. esnap: external snapshots, the cross-bdev case
  6. The naming convention: lvol names live in metadata
  7. How a thin lvol looks to the OS
  8. How diskengine uses snapshots and clones
  9. Edge cases & what trips people up

What "thin" actually means

"Thin" is a single bit on the lvol that says don't allocate the cluster until the first write to it. A thick (non-thin) lvol of size N clusters immediately takes N clusters from the pool. A thin lvol of size N clusters takes zero clusters at creation and grows its allocation as data is written.

The bit is the SPDK_BLOB_THIN_PROV flag in the blob's metadata descriptor ( lib/blob/blobstore.h:339 ). It is set once, at blob creation, and never changed. The runtime check is the public spdk_blob_is_thin_provisioned function ( lib/blob/blobstore.c:9596 ):

The two paths diverge at the first write. A thick lvol has all its clusters pre-allocated in the extent table (an RLE-encoded list of cluster indices in the metadata). A thin lvol has an empty extent table — every entry is "not allocated." A write to a thin lvol triggers a cluster allocation, which is the expensive part. The flow is in spdk_blob_io_writev_ext (called from lvol_write at module/bdev/lvol/vbdev_lvol.c:941 ):

STEP 01
Write arrives
bdev_io of type WRITE, offset + length known
STEP 02
Compute cluster range
start_page / pages_per_cluster gives the cluster index
STEP 03
Check extent table
is the target cluster already allocated?
STEP 04
If thin & not allocated
request a free cluster from the blobstore
STEP 05
Update extent table
metadata write — a small md_page
STEP 06
Write the data
DMA the user's buffer to the new cluster

The lvol tree: lvstore, parent, snapshot, clone

Every lvol has a back-pointer to its lvstore. Some lvols additionally have a back-pointer to a parent. The parent is a snapshot or an external snapshot. The full in-memory struct is in include/spdk_internal/lvolstore.h:109 ; the parent relationship is recorded in the blobstore metadata, not in the lvol struct.

The blobstore has a different vocabulary for the same idea:

lvol termblobstore equivalentWhat it means
regular lvol (no parent)plain blob, no parentOwns its own clusters. Either thick (pre-allocated) or thin (allocated on write).
snapshotblob with SPDK_BLOB_EXTERNAL_SNAPSHOT unset, but is_snapshot bit setAn immutable view of a parent. Shares the parent's clusters.
cloneblob with a parent snapshot, not read-onlyChild of a snapshot. Shares clusters with parent, can write, COW on first write to a shared cluster.
esnap cloneblob with BLOB_EXTERNAL_SNAPSHOT_ID xattrA clone whose parent is an external bdev, not another lvol in this lvstore.
flowchart TB
LVS["lvstore 'lvs0'
ClusterSize=4MiB
TotalDataClusters=1024
FreeClusters=896"] BASE["lvol 'base' (thick)
256 clusters
name=base, uuid=...a1"] SNAP["snapshot 'snap1'
0 clusters (shares base's)
read-only"] CLONE["clone 'cl1'
0 clusters initially
writable, parent=snap1"] ESNAP["esnap clone 'es1'
0 clusters
parent=external bdev Malloc0"] LVS --> BASE BASE -.->|bdev_lvol_snapshot| SNAP SNAP -.->|bdev_lvol_clone| CLONE LVS -.->|bdev_lvol_clone_bdev| ESNAP classDef lvs fill:#f5e6c8,stroke:#a17f1a; classDef lvol fill:#cfe1ff,stroke:#1c4f8a; classDef snap fill:#d6f5d6,stroke:#2a6f2a; classDef clone fill:#f5d6e0,stroke:#8a1c4f; class LVS lvs class BASE lvol class SNAP snap class CLONE clone class ESNAP clone
fig. 1 — an lvstore with a snapshot, a clone, and a free lvol · tap or scroll to zoom · ↗ for fullscreen

fig. 1   A small lvstore. base owns 256 clusters. snap1 is a snapshot of base — it shares the 256 clusters and is read-only. cl1 is a clone of snap1 — also shares those clusters but is writable. es1 is an esnap clone whose parent is an external bdev. 1024 - 256 = 768 clusters consumed; of those, 128 are metadata and FreeClusters = 896.

How bdev_lvol_snapshot works

The RPC is registered at module/bdev/lvol/vbdev_lvol_rpc.c:461 . The handler is straightforward:

The actual work happens in vbdev_lvol_create_snapshot ( module/bdev/lvol/vbdev_lvol.c:1256 ), which calls spdk_lvol_create_snapshot ( lib/lvol/lvol.c:1292 ). That function allocates a new spdk_lvol, attaches a snapshot-specific xattr list, and calls spdk_bs_create_snapshot. The blobstore snapshot is the cheap part: it marks the existing blob as "snapshotting" (a sentinel xattr "SNAPTMP"), reads its extent table, and creates a new blob with the same extent table. The new blob's metadata says "I share clusters with blob X." The new blob is marked read-only.

How bdev_lvol_clone works

A clone is the writable child of a snapshot. The RPC is registered at module/bdev/lvol/vbdev_lvol_rpc.c:539 . The flow is almost identical to the snapshot RPC, except that the source is a snapshot (which means the resulting clone is writable) and the work happens in spdk_lvol_create_clone ( lib/lvol/lvol.c:1353 ).

The clone's initial extent table is empty. Reading from a cluster in the clone that has never been written reads from the parent snapshot. Writing to such a cluster triggers copy-on-write (covered in detail in the next page). The COW happens lazily: the first write to a cluster in the clone is slow (one extra cluster allocation + one extra metadata write), every subsequent write to that cluster is the normal cost.

esnap: external snapshots, the cross-bdev case

"esnap" is short for external snapshot. It is the escape hatch when the parent of a clone is not an lvol in the same lvstore, but some other bdev entirely. The use case: you have a VM disk image on a bdev that lives somewhere else, and you want to clone it into the lvstore for fast write access.

The esnap clone is created via bdev_lvol_clone_bdev, registered at module/bdev/lvol/vbdev_lvol_rpc.c:622 . The handler takes a bdev name (not an lvol name) and creates an esnap clone from it. The C-side glue is vbdev_lvol_create_bdev_clone ( module/bdev/lvol/vbdev_lvol.c:1297 ), which calls spdk_lvol_create_esnap_clone ( lib/lvol/lvol.c:1227 ).

The way the esnap is recorded in the lvstore's metadata is unusual: the parent bdev's UUID is stored as a blob xattr named "EXTSNAP" (defined at lib/blob/blobstore.h:244 ). When the lvol is loaded later (e.g. after a restart), the lvol module looks up that bdev by UUID and uses vbdev_lvol_esnap_dev_create ( module/bdev/lvol/vbdev_lvol.c:1896 ) to attach the external bdev as the clone's backing device.

Diskengine's clone path is the "esnap bdev" variant — see snapshot create path:1 . The esnap bdev is a NBD-export of a snapshot from another storage node, and the esnap clone is the local copy that becomes a new lvol. The data copy is done with dd over NBD; see DDCloneSnapshotToRaid:49 .

The naming convention: lvol names live in metadata

The lvol's name is a blob xattr, not a field on the spdk_lvol struct. The xattr is named "name" and the value is the human-readable name. Diskengine's "uuid" xattr is the canonical stable identifier. Both are set at lvol creation ( lib/lvol/lvol.c:1181 ) and read back at load ( lib/lvol/lvol.c:269 ).

The naming convention that diskengine enforces is that lvol names are the integer lvol ID rendered as a decimal string. From lvol name = lvol.LvolID:89 :

lvolName := fmt.Sprintf("%d", lvol.LvolID)

So an lvol with ID 42 is named "42" in the lvstore. The bdev alias the framework adds is "<lvstore_name>/42" — see module/bdev/lvol/vbdev_lvol.c:1196 , which is the spdk_sprintf_alloc("%s/%s", ...) call. The bdev's actual name (used in spdk_bdev_get_by_name and in JSON-RPC paths) is the lvol UUID, not the alias.

How a thin lvol looks to the OS

To the kernel, a thin lvol is just a bdev. The blocklen is the lvstore's io_unit_size (usually the same as cluster_size), the blockcnt is num_clusters * io_units_per_cluster, the product_name is "Logical Volume" (set at module/bdev/lvol/vbdev_lvol.c:1164 ). The kernel has no idea it's thin; the bdev is reported as having the full size.

What changes is the actual backing storage used. A 100 GiB thin lvol that has only ever been written to its first 4 MiB occupies 1 cluster of the lvstore. The rest of the lvstore's clusters are free. The kernel sees a 100 GiB bdev; the underlying lvstore sees 1 allocated cluster.

Diskengine's reconciliation path notices this gap and propagates it. The

verifystate.go:184:184

code syncs the lvstore's free-cluster count to diskengine's DB. The fact that individual lvol allocated clusters are not tracked in diskengine's DB is intentional — that information comes back from SPDK only on demand.

How diskengine uses snapshots and clones

Diskengine creates snapshots for two reasons: VM snapshots and RAID rebuilds. The create path is in

snapshotcreate.go:1

(out of scope for this page, but mentioned so you can find it). The delete path is in processDeletingSnapshots:40 .

The delete path is where the lvol tree is actually torn down. The relevant loop body, simplified, is:

Edge cases & what trips people up

Snapshot of a thin lvol: cheap, but the parent's data may shift

Snapshotting a thin lvol is fast because the snapshot just records the parent's current extent table. But if the parent was thin and had unallocated clusters, the snapshot inherits the parent's "unallocated" extent. If the parent later allocates a cluster that the snapshot shares, the snapshot sees the same cluster with the new data — snapshots are point-in-time views, but the share-everything optimization means writes to the parent do become visible to the snapshot until COW happens. This is a deliberate design tradeoff; the user-facing rule is: don't expect a snapshot of a thin lvol to be a stable view of the parent's historical state. Snapshot thick lvols, or inflate the lvol first.

Clone tree depth: blobstore does not enforce a limit, but the metadata gets bigger

There's no hard limit on snapshot/clone tree depth in SPDK. The metadata gets bigger as the tree grows, because each new snapshot/clone needs its own metadata page, but the cost is proportional to the number of lvol structs, not the depth. The practical limit is the lvstore's metadata page count, which is set at lvstore create time (num_md_pages_per_cluster_ratio). When you run out, you see a blobstore "no free md page" error.

Circular clones: the assertion at the end of destroy

A circular clone (lvol A is a clone of lvol B is a clone of lvol A) is not possible to construct through the RPC layer — you can only clone a snapshot, and a snapshot is read-only. But if you constructed one programmatically, the lvol destroy path has a backstop:

module/bdev/lvol/vbdev_lvol.c:404

asserts false with the message "Lvols left in lvs, but unable to delete."

Lvstore is full

The most common runtime failure for a thin lvol is ENOSPC at the moment a write forces a cluster allocation. The lvstore has zero free clusters; the allocation fails; the write completes with an error. The lvol is not corrupted — its existing data is intact — but further writes are rejected. The fix is to grow the lvstore (via bdev_lvol_grow_lvstore, which requires the underlying bdev to have spare space) or to delete something.

Snapshot has clones: cannot delete

A snapshot that has one or more clones cannot be deleted. The check is in module/bdev/lvol/vbdev_lvol.c:663 : spdk_blob_get_clones is called, and if the clone count is > 1, the delete is refused with -EPERM. The order matters: delete clones first, then the snapshot.

esnap bdev disappears

If the parent bdev of an esnap clone is removed (e.g. the underlying NVMe controller is detached), the lvol becomes degraded. Reads return -EIO, writes return -EIO, but the lvol itself is still in the lvstore and can be enumerated. When the bdev comes back, the lvol is automatically rehydrated by the hotplug path (vbdev_lvs_examine_config at line 1627). A degraded lvol's memory domain count is also reduced (see module/bdev/lvol/vbdev_lvol.c:1013 ), so the framework knows not to issue DMA operations to it.

What to take away

The lvol layer is a tree on top of a cluster pool. Thin lvols don't allocate until first write. Snapshots share their parent's clusters. Clones are writable children of snapshots; their first write to a shared cluster triggers copy-on-write. esnap clones are the cross-bdev variant — their parent is a UUID-identified bdev, possibly on another node. The tree's metadata is the blobstore's metadata (extent tables, snapshot/clone xattrs); the lvstore just adds the human-readable name and the bdev registration on top. The next page, 5.3 — Copy-on-write internals, is the inside of that first write to a shared cluster.