Layer 5 · lvol

What's actually on the disk.

An lvol bdev is a virtual bdev on top of another bdev. But the backing bdev — the malloc or NVMe or Malloc disk that actually holds the bytes — is divided into fixed-size clusters, and a small region at the start of that backing bdev is dedicated to a metadata area that the lvol subsystem owns outright. This page is about that metadata: how an lvstore is described in raw bytes, how the on-disk super block works, how a GPT partition fits in, and what happens on startup when SPDK loads the lvstore.

~12 min read2 diagramsprerequisites: 4.4 hierarchy · 3.2 bdev_lvol_create end-to-end
On this page
  1. What an lvstore is, in one paragraph
  2. Cluster size, lvol size, total clusters, free clusters
  3. The on-disk metadata format: spdk_bs_super_block
  4. The GPT partition that backs the lvstore
  5. How bdev_lvol_create chooses the lvstore
  6. How the lvstore is loaded on SPDK startup
  7. What diskengine sees: FreeClusters * ClusterSize
  8. Edge cases & what trips people up

What an lvstore is, in one paragraph

An lvstore is a self-contained allocator for a single backing bdev. It owns a fixed cluster size, carves the bdev into N clusters, and maintains a metadata area at the start that records which clusters are in use, which are free, and which belong to which lvol. Every lvol — a thin or thick "logical volume" bdev — is a child of exactly one lvstore. The lvstore's on-disk image is a single, self-describing blobstore (SPDK uses the word blobstore internally; lvstore is the bdev-level name for the same object).

The in-memory struct is in include/spdk_internal/lvolstore.h:88 . The on-disk mirror is a blobstore, and the blobstore's public-facing struct is in lib/blob/blobstore.h:156 . We will look at both, but in different orders.

Cluster size, lvol size, total clusters, free clusters

There are exactly four numbers that describe the geometry of an lvstore. If you remember nothing else from this page, remember these four:

NameWhat it countsWhere it livesUnits
ClusterSizeSize of one cluster (the allocation unit)lvstore metadata (immutable after init)bytes
TotalDataClustersClusters in the lvstore that hold user dataspdk_blob_store.total_data_clustersclusters
FreeClustersClusters not allocated to any lvolspdk_blob_store.num_free_clustersclusters
TotalClustersTotal clusters (data + metadata)spdk_blob_store.total_clustersclusters

The relationship that diskengine cares about is:

lvstore_capacity_bytes = TotalDataClusters * ClusterSize
lvstore_free_bytes    = FreeClusters * ClusterSize

Both formulas appear in lvstore capacity sync:184 , where diskengine re-syncs its DB view of the lvstore against SPDK's view every cycle. The two numbers come back across RPC as total_data_clusters and free_clusters in the bdev_lvol_get_lvstore response, plus cluster_size. Diskengine multiplies them client-side. The two clusters that look the same on the wire are very different on disk: total_data_clusters excludes the clusters stolen by the metadata area.

A typical default is 4 MiB clusters, set by SPDK_LVS_OPTS_CLUSTER_SZ at include/spdk_internal/lvolstore.h:17 . You can override it at lvstore create time. The cluster size must be a multiple of the backing bdev's block size (typically 512 or 4096 bytes). The lvol size is rounded up to a whole number of clusters — an lvol that is 1 byte too long for its cluster count just consumes the next cluster.

The on-disk metadata format: spdk_bs_super_block

The very first cluster of the lvstore (cluster 0, the lowest LBA range on the bdev) is the super block. It is a full 4 KiB page — even if the cluster size is larger, the super block is a single spdk_bs_super_block struct padded out to exactly 0x1000 bytes. The struct definition is in lib/blob/blobstore.h:407 , and it's exhaustively annotated below.

When SPDK starts up and tries to load the lvstore, the first thing it does is read this struct, check the signature, check the version, and check the CRC. That logic is at lib/blob/blobstore.c:1822 :

The GPT partition that backs the lvstore

On a real disk, the lvstore does not live at sector 0 — it lives at the start of a GPT partition. The convention is that the partition type UUID identifies the partition as a SPDK blobstore/lvstore. On the diskengine side, the partition is created by the disk provisioning code; on the SPDK side, the lvstore's vbdev_lvs_examine_disk callback is invoked for every examined bdev, and only the ones that look like blobstores (have the SPDK signature and a matching bstype) are loaded.

Two things to know about the partition:

  1. The partition type UUID must match the convention. Otherwise tooling won't see the partition. The actual type UUID used by diskengine is set in the disk provisioning step; the convention is that the partition type UUID begins with the bytes that identify the partition as an SPDK blobstore. See

    disk provisioning:1

    for the specific UUID used.

  2. The partition's LBA range is the lvstore's address space. The lvstore does not know about the rest of the disk. total_clusters in the super block is partition_bytes / cluster_size, and md_start is an offset from the partition's start, not the disk's.

flowchart LR
subgraph Disk["Physical disk (NVMe namespace)"]
  subgraph P1["GPT partition (lvolstore)"]
    SB["Cluster 0: spdk_bs_super_block
signature, version, CRC
md_start, md_len, cluster_size, ..."] MD["Clusters 1..N: metadata area
used_page_mask, used_cluster_mask
used_blobid_mask, blob MD pages"] DATA["Clusters N+1..M: data clusters
each cluster holds lvol data
or is unallocated (free)"] end subgraph P2["Other partitions / free space"] OTHER["unrelated data or other bdevs"] end end SB --> MD --> DATA classDef sb fill:#f5e6c8,stroke:#a17f1a; classDef md fill:#cfe1ff,stroke:#1c4f8a; classDef data fill:#d6f5d6,stroke:#2a6f2a; class SB sb class MD md class DATA data
fig. 1 — an lvstore's address space on a partition · tap or scroll to zoom · ↗ for fullscreen

fig. 1   The lvstore is one GPT partition on the disk. Inside the partition, the first cluster is the super block; the next few clusters are the metadata area; everything after that is data clusters, allocated to lvols or free.

How bdev_lvol_create chooses the lvstore

When diskengine calls bdev_lvol_create, the RPC payload must specify which lvstore to create the lvol in. There are two ways: by UUID or by name. They are mutually exclusive — the helper at

module/bdev/lvol/vbdev_lvol_rpc.c:42

enforces this:

Diskengine always passes the UUID (see LvstoreUUID:91 ). The UUID is the stable identifier; the name is a human-readable label that can change via bdev_lvol_rename_lvstore. For automation, always prefer the UUID.

The two lookup functions ( module/bdev/lvol/vbdev_lvol.c:549 and module/bdev/lvol/vbdev_lvol.c:561 ) are linear scans over the in-memory list g_spdk_lvol_pairs (a TAILQ of struct lvol_store_bdev). For an SPDK instance with 2-3 lvstores, this is a few comparisons. The list is protected by being reactor-local (all lvol operations are on one thread).

How the lvstore is loaded on SPDK startup

When SPDK starts and the bdev framework is initialized, the lvol module's examine_disk callback runs for every bdev that has not been claimed. The callback tries to detect an lvstore on each bdev. The function is at module/bdev/lvol/vbdev_lvol.c:1754 .

What diskengine sees: FreeClusters * ClusterSize

The two numbers that diskengine actually consumes from SPDK are cluster_size, total_data_clusters, and free_clusters. They come back across the bdev_lvol_get_lvstore RPC. Diskengine's view of "what's the lvstore's free space" is just the multiplication, done in spdkFree = FreeClusters * ClusterSize:184 .

Why the multiplication happens client-side, not server-side: SPDK's RPC returns raw cluster counts because that is the canonical unit. Anything derived from those counts (free bytes, used bytes, percent full) is a client decision. This is also why the two multiplication sites in

verifystate.go:184:184

and

snapshotdelete.go:95:95

look the same — both come from the same RPC, just consumed by different parts of diskengine (reconciliation vs. post-delete cleanup).

Edge cases & what trips people up

Corrupt lvstore: signature mismatch

If the first 8 bytes of the partition are not "SPDKBLOB", bs_super_validate returns -EILSEQ. The examine callback bails out and the lvstore is not loaded. The bdev is still there, the partition is still there — they just don't look like an lvstore anymore. The recovery options are: reformat the partition (which destroys the lvstore) or restore from a backup.

Corrupt lvstore: bad CRC

The signature is right but the CRC isn't. This usually means a torn write — the SPDK instance died in the middle of updating the super block. The clean field would also be 0 in this case. The lvstore will not load. The same recovery options apply.

Mismatched metadata versions

If super->version is greater than SPDK_BS_VERSION (a newer on-disk format than this SPDK knows), the validate function returns -EILSEQ and the load fails. This is "future-proof" — running an old SPDK against a newer lvstore is refused. The fix is to upgrade SPDK. The reverse case (old version on a newer SPDK) is handled by migration: SPDK may write the new format on first metadata sync.

Partially written cluster

A torn write to a data cluster is not a metadata problem. The metadata may be consistent; the data cluster just contains garbage in the part that wasn't written. The next read of that cluster will return whatever was on the disk. This is why snapshots (which use copy-on-write) are useful for crash-consistency: the snapshot's metadata is written before the data is touched.

Too-small backing bdev

super->size must be less than or equal to the backing bdev's size. If you replaced the bdev with a smaller one (e.g. swapped a 1 TiB NVMe for a 500 GiB one without re-partitioning), the validate function at line 1853 will reject it. The lvstore won't load. The fix is to grow the partition on the new device or restore from backup.

bstype mismatch

If the super block's bstype is not "LVOLSTORE" and not all-zeroes, the load returns -ENXIO. This is what happens if you accidentally point a generic blobstore RPC at a non-lvstore blobstore (e.g. a PREFIX bdev that lives in the same SPDK instance). Same answer: don't load it.

Two lvstores with the same name on two different partitions

The examine path is happy to load both — each one has a different UUID, so they can coexist. But the RPC layer's name-based lookup (vbdev_get_lvol_store_by_name) returns the first one in the list, which is undefined. If you must address an lvstore from automation, use the UUID.

What to take away

An lvstore is a cluster allocator + a metadata area, both living on a single backing bdev. The metadata area starts with a 4 KiB super block that describes the layout, signed with "SPDKBLOB" and protected by a CRC32C. A load fails closed if any of the five sanity checks (version, signature, CRC, bstype, size) trips. The total capacity is TotalDataClusters * ClusterSize; the free space is FreeClusters * ClusterSize. Both of these numbers flow from SPDK to diskengine across the bdev_lvol_get_lvstore RPC, and diskengine's state-verification path multiplies them client-side.

The next page, 5.2 — Thin provisioning + snapshots, is about the children of the lvstore: the lvols themselves, and how thin provisioning and copy-on-write change the picture.