What's actually on the disk.
An lvol bdev is a virtual bdev on top of another bdev. But the backing bdev — the malloc or NVMe or Malloc disk that actually holds the bytes — is divided into fixed-size clusters, and a small region at the start of that backing bdev is dedicated to a metadata area that the lvol subsystem owns outright. This page is about that metadata: how an lvstore is described in raw bytes, how the on-disk super block works, how a GPT partition fits in, and what happens on startup when SPDK loads the lvstore.
- What an lvstore is, in one paragraph
- Cluster size, lvol size, total clusters, free clusters
- The on-disk metadata format:
spdk_bs_super_block - The GPT partition that backs the lvstore
- How
bdev_lvol_createchooses the lvstore - How the lvstore is loaded on SPDK startup
- What diskengine sees:
FreeClusters * ClusterSize - Edge cases & what trips people up
What an lvstore is, in one paragraph
An lvstore is a self-contained allocator for a single backing bdev. It owns a fixed cluster size, carves the bdev into N clusters, and maintains a metadata area at the start that records which clusters are in use, which are free, and which belong to which lvol. Every lvol — a thin or thick "logical volume" bdev — is a child of exactly one lvstore. The lvstore's on-disk image is a single, self-describing blobstore (SPDK uses the word blobstore internally; lvstore is the bdev-level name for the same object).
The in-memory struct is in include/spdk_internal/lvolstore.h:88 . The on-disk mirror is a blobstore, and the blobstore's public-facing struct is in lib/blob/blobstore.h:156 . We will look at both, but in different orders.
Cluster size, lvol size, total clusters, free clusters
There are exactly four numbers that describe the geometry of an lvstore. If you remember nothing else from this page, remember these four:
| Name | What it counts | Where it lives | Units |
|---|---|---|---|
ClusterSize | Size of one cluster (the allocation unit) | lvstore metadata (immutable after init) | bytes |
TotalDataClusters | Clusters in the lvstore that hold user data | spdk_blob_store.total_data_clusters | clusters |
FreeClusters | Clusters not allocated to any lvol | spdk_blob_store.num_free_clusters | clusters |
TotalClusters | Total clusters (data + metadata) | spdk_blob_store.total_clusters | clusters |
The relationship that diskengine cares about is:
lvstore_capacity_bytes = TotalDataClusters * ClusterSize
lvstore_free_bytes = FreeClusters * ClusterSizeBoth formulas appear in
lvstore capacity sync:184 ,
where diskengine re-syncs its DB view of the lvstore against
SPDK's view every cycle. The two numbers come back across RPC as
total_data_clusters and free_clusters in
the bdev_lvol_get_lvstore response, plus
cluster_size. Diskengine multiplies them
client-side. The two clusters that look the same on the wire
are very different on disk: total_data_clusters
excludes the clusters stolen by the metadata area.
A typical default is 4 MiB clusters, set by
SPDK_LVS_OPTS_CLUSTER_SZ at
include/spdk_internal/lvolstore.h:17 .
You can override it at lvstore create time. The cluster size
must be a multiple of the backing bdev's block size (typically
512 or 4096 bytes). The lvol size is rounded up to a whole
number of clusters — an lvol that is 1 byte too long for its
cluster count just consumes the next cluster.
The on-disk metadata format: spdk_bs_super_block
The very first cluster of the lvstore (cluster 0, the lowest
LBA range on the bdev) is the super block. It
is a full 4 KiB page — even if the cluster size is larger, the
super block is a single spdk_bs_super_block struct
padded out to exactly 0x1000 bytes. The struct
definition is in
lib/blob/blobstore.h:407 , and it's
exhaustively annotated below.
When SPDK starts up and tries to load the lvstore, the first thing it does is read this struct, check the signature, check the version, and check the CRC. That logic is at lib/blob/blobstore.c:1822 :
The GPT partition that backs the lvstore
On a real disk, the lvstore does not live at sector 0 — it
lives at the start of a GPT partition. The
convention is that the partition type UUID identifies the
partition as a SPDK blobstore/lvstore. On the diskengine side,
the partition is created by the disk provisioning code; on the
SPDK side, the lvstore's vbdev_lvs_examine_disk
callback is invoked for every examined bdev, and only the ones
that look like blobstores (have the SPDK signature and a
matching bstype) are loaded.
Two things to know about the partition:
The partition type UUID must match the convention. Otherwise tooling won't see the partition. The actual type UUID used by diskengine is set in the disk provisioning step; the convention is that the partition type UUID begins with the bytes that identify the partition as an SPDK blobstore. See
disk provisioning:1for the specific UUID used.
The partition's LBA range is the lvstore's address space. The lvstore does not know about the rest of the disk.
total_clustersin the super block ispartition_bytes / cluster_size, andmd_startis an offset from the partition's start, not the disk's.
flowchart LR
subgraph Disk["Physical disk (NVMe namespace)"]
subgraph P1["GPT partition (lvolstore)"]
SB["Cluster 0: spdk_bs_super_block
signature, version, CRC
md_start, md_len, cluster_size, ..."]
MD["Clusters 1..N: metadata area
used_page_mask, used_cluster_mask
used_blobid_mask, blob MD pages"]
DATA["Clusters N+1..M: data clusters
each cluster holds lvol data
or is unallocated (free)"]
end
subgraph P2["Other partitions / free space"]
OTHER["unrelated data or other bdevs"]
end
end
SB --> MD --> DATA
classDef sb fill:#f5e6c8,stroke:#a17f1a;
classDef md fill:#cfe1ff,stroke:#1c4f8a;
classDef data fill:#d6f5d6,stroke:#2a6f2a;
class SB sb
class MD md
class DATA data fig. 1 The lvstore is one GPT partition on the disk. Inside the partition, the first cluster is the super block; the next few clusters are the metadata area; everything after that is data clusters, allocated to lvols or free.
How bdev_lvol_create chooses the lvstore
When diskengine calls bdev_lvol_create, the RPC
payload must specify which lvstore to create the lvol
in. There are two ways: by UUID or by name. They are mutually
exclusive — the helper at
enforces this:
Diskengine always passes the UUID (see
LvstoreUUID:91 ).
The UUID is the stable identifier; the name is a human-readable
label that can change via bdev_lvol_rename_lvstore.
For automation, always prefer the UUID.
The two lookup functions
( module/bdev/lvol/vbdev_lvol.c:549
and
module/bdev/lvol/vbdev_lvol.c:561 )
are linear scans over the in-memory list
g_spdk_lvol_pairs (a TAILQ of
struct lvol_store_bdev). For an SPDK instance
with 2-3 lvstores, this is a few comparisons. The list is
protected by being reactor-local (all lvol operations are on
one thread).
How the lvstore is loaded on SPDK startup
When SPDK starts and the bdev framework is initialized, the
lvol module's examine_disk callback runs for every
bdev that has not been claimed. The callback tries to detect an
lvstore on each bdev. The function is at
module/bdev/lvol/vbdev_lvol.c:1754 .
What diskengine sees: FreeClusters * ClusterSize
The two numbers that diskengine actually consumes from SPDK are
cluster_size, total_data_clusters,
and free_clusters. They come back across the
bdev_lvol_get_lvstore RPC. Diskengine's view of
"what's the lvstore's free space" is just the multiplication,
done in
spdkFree = FreeClusters * ClusterSize:184 .
Why the multiplication happens client-side, not server-side: SPDK's RPC returns raw cluster counts because that is the canonical unit. Anything derived from those counts (free bytes, used bytes, percent full) is a client decision. This is also why the two multiplication sites in
verifystate.go:184:184and
snapshotdelete.go:95:95look the same — both come from the same RPC, just consumed by different parts of diskengine (reconciliation vs. post-delete cleanup).
Edge cases & what trips people up
Corrupt lvstore: signature mismatch
If the first 8 bytes of the partition are not
"SPDKBLOB", bs_super_validate returns
-EILSEQ. The examine callback bails out and the
lvstore is not loaded. The bdev is still there, the partition
is still there — they just don't look like an lvstore anymore.
The recovery options are: reformat the partition (which
destroys the lvstore) or restore from a backup.
Corrupt lvstore: bad CRC
The signature is right but the CRC isn't. This usually means a
torn write — the SPDK instance died in the middle of updating
the super block. The clean field would also be 0
in this case. The lvstore will not load. The same recovery
options apply.
Mismatched metadata versions
If super->version is greater than
SPDK_BS_VERSION (a newer on-disk format than this
SPDK knows), the validate function returns -EILSEQ
and the load fails. This is "future-proof" — running an old
SPDK against a newer lvstore is refused. The fix is to upgrade
SPDK. The reverse case (old version on a newer SPDK) is
handled by migration: SPDK may write the new format on first
metadata sync.
Partially written cluster
A torn write to a data cluster is not a metadata problem. The metadata may be consistent; the data cluster just contains garbage in the part that wasn't written. The next read of that cluster will return whatever was on the disk. This is why snapshots (which use copy-on-write) are useful for crash-consistency: the snapshot's metadata is written before the data is touched.
Too-small backing bdev
super->size must be less than or equal to the
backing bdev's size. If you replaced the bdev with a smaller
one (e.g. swapped a 1 TiB NVMe for a 500 GiB one without
re-partitioning), the validate function at line 1853 will
reject it. The lvstore won't load. The fix is to grow the
partition on the new device or restore from backup.
bstype mismatch
If the super block's bstype is not
"LVOLSTORE" and not all-zeroes, the load returns
-ENXIO. This is what happens if you accidentally
point a generic blobstore RPC at a non-lvstore blobstore
(e.g. a PREFIX bdev that lives in the same SPDK instance).
Same answer: don't load it.
Two lvstores with the same name on two different partitions
The examine path is happy to load both — each one has a
different UUID, so they can coexist. But the RPC layer's
name-based lookup (vbdev_get_lvol_store_by_name)
returns the first one in the list, which is undefined. If you
must address an lvstore from automation, use the UUID.
What to take away
An lvstore is a cluster allocator + a metadata area, both
living on a single backing bdev. The metadata area starts with
a 4 KiB super block that describes the layout, signed with
"SPDKBLOB" and protected by a CRC32C. A load fails closed if
any of the five sanity checks (version, signature, CRC, bstype,
size) trips. The total capacity is
TotalDataClusters * ClusterSize; the free space
is FreeClusters * ClusterSize. Both of these
numbers flow from SPDK to diskengine across the
bdev_lvol_get_lvstore RPC, and diskengine's
state-verification path multiplies them client-side.
The next page, 5.2 — Thin provisioning + snapshots, is about the children of the lvstore: the lvols themselves, and how thin provisioning and copy-on-write change the picture.