The tree that lives on top of the cluster pool.
The lvstore is a cluster pool. It has clusters; some are allocated, some are free. The lvols are the consumers of the cluster pool, but they're more than that — they form a tree. A thick lvol allocates its own clusters. A thin lvol allocates on first write. A snapshot shares its parent's clusters until a write forces a copy-on-write. A clone is a child snapshot that the host writes to. This page is about that tree: how it's built, how the parent-child relationship is recorded, and what diskengine does with it.
- What "thin" actually means
- The lvol tree: lvstore, parent, snapshot, clone
- How
bdev_lvol_snapshotworks - How
bdev_lvol_cloneworks - esnap: external snapshots, the cross-bdev case
- The naming convention: lvol names live in metadata
- How a thin lvol looks to the OS
- How diskengine uses snapshots and clones
- Edge cases & what trips people up
What "thin" actually means
"Thin" is a single bit on the lvol that says don't allocate the cluster until the first write to it. A thick (non-thin) lvol of size N clusters immediately takes N clusters from the pool. A thin lvol of size N clusters takes zero clusters at creation and grows its allocation as data is written.
The bit is the SPDK_BLOB_THIN_PROV flag in the
blob's metadata descriptor
( lib/blob/blobstore.h:339 ). It is set
once, at blob creation, and never changed. The runtime check
is the public spdk_blob_is_thin_provisioned
function
( lib/blob/blobstore.c:9596 ):
The two paths diverge at the first write. A thick lvol has all
its clusters pre-allocated in the extent table (an RLE-encoded
list of cluster indices in the metadata). A thin lvol has an
empty extent table — every entry is "not allocated." A write
to a thin lvol triggers a cluster allocation, which is the
expensive part. The flow is in
spdk_blob_io_writev_ext (called from
lvol_write at
module/bdev/lvol/vbdev_lvol.c:941 ):
The lvol tree: lvstore, parent, snapshot, clone
Every lvol has a back-pointer to its lvstore. Some lvols additionally have a back-pointer to a parent. The parent is a snapshot or an external snapshot. The full in-memory struct is in include/spdk_internal/lvolstore.h:109 ; the parent relationship is recorded in the blobstore metadata, not in the lvol struct.
The blobstore has a different vocabulary for the same idea:
| lvol term | blobstore equivalent | What it means |
|---|---|---|
| regular lvol (no parent) | plain blob, no parent | Owns its own clusters. Either thick (pre-allocated) or thin (allocated on write). |
| snapshot | blob with SPDK_BLOB_EXTERNAL_SNAPSHOT unset, but is_snapshot bit set | An immutable view of a parent. Shares the parent's clusters. |
| clone | blob with a parent snapshot, not read-only | Child of a snapshot. Shares clusters with parent, can write, COW on first write to a shared cluster. |
| esnap clone | blob with BLOB_EXTERNAL_SNAPSHOT_ID xattr | A clone whose parent is an external bdev, not another lvol in this lvstore. |
flowchart TB LVS["lvstore 'lvs0'
ClusterSize=4MiB
TotalDataClusters=1024
FreeClusters=896"] BASE["lvol 'base' (thick)
256 clusters
name=base, uuid=...a1"] SNAP["snapshot 'snap1'
0 clusters (shares base's)
read-only"] CLONE["clone 'cl1'
0 clusters initially
writable, parent=snap1"] ESNAP["esnap clone 'es1'
0 clusters
parent=external bdev Malloc0"] LVS --> BASE BASE -.->|bdev_lvol_snapshot| SNAP SNAP -.->|bdev_lvol_clone| CLONE LVS -.->|bdev_lvol_clone_bdev| ESNAP classDef lvs fill:#f5e6c8,stroke:#a17f1a; classDef lvol fill:#cfe1ff,stroke:#1c4f8a; classDef snap fill:#d6f5d6,stroke:#2a6f2a; classDef clone fill:#f5d6e0,stroke:#8a1c4f; class LVS lvs class BASE lvol class SNAP snap class CLONE clone class ESNAP clone
fig. 1 A small lvstore. base owns 256
clusters. snap1 is a snapshot of
base — it shares the 256 clusters and is
read-only. cl1 is a clone of snap1 —
also shares those clusters but is writable. es1 is
an esnap clone whose parent is an external bdev. 1024 - 256 =
768 clusters consumed; of those, 128 are metadata and
FreeClusters = 896.
How bdev_lvol_snapshot works
The RPC is registered at module/bdev/lvol/vbdev_lvol_rpc.c:461 . The handler is straightforward:
The actual work happens in
vbdev_lvol_create_snapshot
( module/bdev/lvol/vbdev_lvol.c:1256 ),
which calls
spdk_lvol_create_snapshot
( lib/lvol/lvol.c:1292 ). That function
allocates a new spdk_lvol, attaches a
snapshot-specific xattr list, and calls
spdk_bs_create_snapshot. The blobstore snapshot
is the cheap part: it marks the existing blob as
"snapshotting" (a sentinel xattr "SNAPTMP"), reads
its extent table, and creates a new blob with the same extent
table. The new blob's metadata says "I share clusters with
blob X." The new blob is marked read-only.
How bdev_lvol_clone works
A clone is the writable child of a snapshot. The RPC is
registered at
module/bdev/lvol/vbdev_lvol_rpc.c:539 .
The flow is almost identical to the snapshot RPC, except that
the source is a snapshot (which means the resulting clone is
writable) and the work happens in
spdk_lvol_create_clone
( lib/lvol/lvol.c:1353 ).
The clone's initial extent table is empty. Reading from a cluster in the clone that has never been written reads from the parent snapshot. Writing to such a cluster triggers copy-on-write (covered in detail in the next page). The COW happens lazily: the first write to a cluster in the clone is slow (one extra cluster allocation + one extra metadata write), every subsequent write to that cluster is the normal cost.
esnap: external snapshots, the cross-bdev case
"esnap" is short for external snapshot. It is the escape hatch when the parent of a clone is not an lvol in the same lvstore, but some other bdev entirely. The use case: you have a VM disk image on a bdev that lives somewhere else, and you want to clone it into the lvstore for fast write access.
The esnap clone is created via
bdev_lvol_clone_bdev, registered at
module/bdev/lvol/vbdev_lvol_rpc.c:622 .
The handler takes a bdev name (not an lvol name) and creates
an esnap clone from it. The C-side glue is
vbdev_lvol_create_bdev_clone
( module/bdev/lvol/vbdev_lvol.c:1297 ),
which calls
spdk_lvol_create_esnap_clone
( lib/lvol/lvol.c:1227 ).
The way the esnap is recorded in the lvstore's metadata is
unusual: the parent bdev's UUID is stored as a blob xattr
named "EXTSNAP" (defined at
lib/blob/blobstore.h:244 ). When the
lvol is loaded later (e.g. after a restart), the lvol module
looks up that bdev by UUID and uses
vbdev_lvol_esnap_dev_create
( module/bdev/lvol/vbdev_lvol.c:1896 )
to attach the external bdev as the clone's backing device.
Diskengine's clone path is the "esnap bdev" variant — see
snapshot create path:1 .
The esnap bdev is a NBD-export of a snapshot from another
storage node, and the esnap clone is the local copy that
becomes a new lvol. The data copy is done with
dd over NBD; see
DDCloneSnapshotToRaid:49 .
The naming convention: lvol names live in metadata
The lvol's name is a blob xattr, not a field on the
spdk_lvol struct. The xattr is named "name" and
the value is the human-readable name. Diskengine's
"uuid" xattr is the canonical stable identifier.
Both are set at lvol creation
( lib/lvol/lvol.c:1181 ) and read
back at load
( lib/lvol/lvol.c:269 ).
The naming convention that diskengine enforces is that lvol names are the integer lvol ID rendered as a decimal string. From lvol name = lvol.LvolID:89 :
lvolName := fmt.Sprintf("%d", lvol.LvolID)So an lvol with ID 42 is named "42" in the
lvstore. The bdev alias the framework adds is
"<lvstore_name>/42" — see
module/bdev/lvol/vbdev_lvol.c:1196 ,
which is the spdk_sprintf_alloc("%s/%s", ...)
call. The bdev's actual name (used in
spdk_bdev_get_by_name and in JSON-RPC paths) is
the lvol UUID, not the alias.
How a thin lvol looks to the OS
To the kernel, a thin lvol is just a bdev. The
blocklen is the lvstore's io_unit_size
(usually the same as cluster_size), the
blockcnt is
num_clusters * io_units_per_cluster, the
product_name is "Logical Volume"
(set at module/bdev/lvol/vbdev_lvol.c:1164 ).
The kernel has no idea it's thin; the bdev is reported as
having the full size.
What changes is the actual backing storage used. A 100 GiB thin lvol that has only ever been written to its first 4 MiB occupies 1 cluster of the lvstore. The rest of the lvstore's clusters are free. The kernel sees a 100 GiB bdev; the underlying lvstore sees 1 allocated cluster.
Diskengine's reconciliation path notices this gap and propagates it. The
verifystate.go:184:184code syncs the lvstore's free-cluster count to diskengine's DB. The fact that individual lvol allocated clusters are not tracked in diskengine's DB is intentional — that information comes back from SPDK only on demand.
How diskengine uses snapshots and clones
Diskengine creates snapshots for two reasons: VM snapshots and RAID rebuilds. The create path is in
snapshotcreate.go:1(out of scope for this page, but mentioned so you can find it). The delete path is in processDeletingSnapshots:40 .
The delete path is where the lvol tree is actually torn down. The relevant loop body, simplified, is:
Edge cases & what trips people up
Snapshot of a thin lvol: cheap, but the parent's data may shift
Snapshotting a thin lvol is fast because the snapshot just records the parent's current extent table. But if the parent was thin and had unallocated clusters, the snapshot inherits the parent's "unallocated" extent. If the parent later allocates a cluster that the snapshot shares, the snapshot sees the same cluster with the new data — snapshots are point-in-time views, but the share-everything optimization means writes to the parent do become visible to the snapshot until COW happens. This is a deliberate design tradeoff; the user-facing rule is: don't expect a snapshot of a thin lvol to be a stable view of the parent's historical state. Snapshot thick lvols, or inflate the lvol first.
Clone tree depth: blobstore does not enforce a limit, but the metadata gets bigger
There's no hard limit on snapshot/clone tree depth in SPDK.
The metadata gets bigger as the tree grows, because each new
snapshot/clone needs its own metadata page, but the cost is
proportional to the number of lvol structs, not the depth.
The practical limit is the lvstore's metadata page count,
which is set at lvstore create time
(num_md_pages_per_cluster_ratio). When you run
out, you see a blobstore "no free md page" error.
Circular clones: the assertion at the end of destroy
A circular clone (lvol A is a clone of lvol B is a clone of lvol A) is not possible to construct through the RPC layer — you can only clone a snapshot, and a snapshot is read-only. But if you constructed one programmatically, the lvol destroy path has a backstop:
module/bdev/lvol/vbdev_lvol.c:404asserts false with the message
"Lvols left in lvs, but unable to delete."
Lvstore is full
The most common runtime failure for a thin lvol is
ENOSPC at the moment a write forces a cluster
allocation. The lvstore has zero free clusters; the
allocation fails; the write completes with an error. The
lvol is not corrupted — its existing data is intact — but
further writes are rejected. The fix is to grow the lvstore
(via bdev_lvol_grow_lvstore, which requires the
underlying bdev to have spare space) or to delete something.
Snapshot has clones: cannot delete
A snapshot that has one or more clones cannot be deleted.
The check is in
module/bdev/lvol/vbdev_lvol.c:663 :
spdk_blob_get_clones is called, and if the clone
count is > 1, the delete is refused with
-EPERM. The order matters: delete clones first,
then the snapshot.
esnap bdev disappears
If the parent bdev of an esnap clone is removed (e.g. the
underlying NVMe controller is detached), the lvol becomes
degraded. Reads return -EIO, writes return -EIO, but
the lvol itself is still in the lvstore and can be
enumerated. When the bdev comes back, the lvol is
automatically rehydrated by the hotplug path
(vbdev_lvs_examine_config at line 1627). A
degraded lvol's memory domain count is also reduced
(see
module/bdev/lvol/vbdev_lvol.c:1013 ),
so the framework knows not to issue DMA operations to it.
What to take away
The lvol layer is a tree on top of a cluster pool. Thin lvols don't allocate until first write. Snapshots share their parent's clusters. Clones are writable children of snapshots; their first write to a shared cluster triggers copy-on-write. esnap clones are the cross-bdev variant — their parent is a UUID-identified bdev, possibly on another node. The tree's metadata is the blobstore's metadata (extent tables, snapshot/clone xattrs); the lvstore just adds the human-readable name and the bdev registration on top. The next page, 5.3 — Copy-on-write internals, is the inside of that first write to a shared cluster.