Layer 5 · lvol

The slow write, the fast write, and why they differ.

A clone's first write to a cluster shared with its parent snapshot is a copy-on-write: the cluster is allocated from the free pool, the user's data is written to the new cluster, and the clone's extent table is updated to point at the new cluster. From the clone's perspective, the cluster now has its own copy. From the snapshot's perspective, nothing changed. This page is the inside of that operation: the structures, the metadata writes, and the failure modes.

~15 min read2 diagramsprerequisites: 5.2 thin + snapshots · 5.1 lvstore + metadata

On this page

The CoW data structure: the extent table
What happens on first write to a shared cluster
What happens on read of an unwritten cluster
The metadata write: a small md_page
Random-write performance: why snapshots are slower
CoW and delete: refcounting clusters
Inflate, decouple, and the snapshot-merge story
Edge cases & what trips people up

The CoW data structure: the extent table

The CoW machinery is built on a single data structure: the blob's extent table. The extent table maps the lvol's logical cluster indices to physical cluster indices in the lvstore. A "0" entry in the table means "unallocated" (the cluster doesn't have a backing cluster yet). A non-zero entry means "this logical cluster is backed by physical cluster N."

The extent table is run-length-encoded (RLE) on disk. The on-disk descriptor is SPDK_MD_DESCRIPTOR_TYPE_EXTENT_RLE ( lib/blob/blobstore.h:283 ) — an array of (cluster_idx, length) pairs. A snapshot's extent table is a copy of the parent's. A clone's extent table is initially empty; it grows as CoW happens.

spdk_v26_01_migration/lib/blob/blobstore.h · lines 306-314 EXTENT_RLE descriptor — the CoW building block

struct spdk_blob_md_descriptor_extent_rle {
    uint8_t     type;
    uint32_t    length;

    struct {
        uint32_t    cluster_idx;     /* start cluster, or 0 if unallocated */
        uint32_t    length;          /* in units of clusters */
    } extents[0];
};

A run of length N with cluster_idx 0 means "N unallocated clusters in a row." A run with cluster_idx > 0 means "N allocated clusters starting at physical cluster cluster_idx." The RLE encoding means a fresh snapshot of a 1 TiB lvol on a 4 MiB cluster lvstore has 256K entries, but RLE-encoding the runs compresses that to a few thousand descriptors if the original is fully allocated.

For very large lvols, the extent table itself doesn't fit in one metadata page. The blobstore handles this with two more descriptor types: SPDK_MD_DESCRIPTOR_TYPE_EXTENT_TABLE ( lib/blob/blobstore.h:288 ) and SPDK_MD_DESCRIPTOR_TYPE_EXTENT_PAGE ( lib/blob/blobstore.h:293 ). The table points to multiple extent pages; each extent page holds a chunk of the cluster index array. The choice of RLE vs. extent_table is a per-blob decision and is detected automatically based on size (see use_extent_table in the blob struct).

What happens on first write to a shared cluster

The bdev layer's lvol_write function ( module/bdev/lvol/vbdev_lvol.c:941 ) hands the request to the blobstore. The blobstore's spdk_blob_io_writev_ext is the actual workhorse. For an unallocated cluster in a thin blob, the path is:

STEP 01

Compute cluster index

from offset_blocks / pages_per_cluster

→

STEP 02

Check extent table

is the target cluster allocated?

→

STEP 03

If not allocated

request a free cluster from the blobstore pool

→

STEP 04

Persist the user's data

DMA into the new cluster

→

STEP 05

Update the extent table

set the cluster_idx for this logical cluster

→

STEP 06

Persist the metadata

small md_page write

The first write to a shared cluster (i.e. a cluster that the clone is reading from its parent snapshot) takes a slightly different path. The extent table entry is non-zero (it points at the snapshot's cluster), but the act of writing means the clone needs its own copy. The sequence:

spdk_v26_01_migration/lib/blob/blobstore.c · lines 4535-4590 spdk_bs_allocate_cluster — the per-cluster CoW

This is the function that actually allocates a new cluster for the clone. The walk through the in-memory state:

if (is_allocated) {
    /* Already has its own cluster — just write to it. */
    if (cluster != NULL) *cluster = ...;
} else {
    /* Need to allocate. Pick a free cluster from the pool. */
    if (bs->num_free_clusters == 0) {
        /* Pool empty — fail. */
        return -ENOSPC;
    }
    bs->num_free_clusters--;
    cluster_no = spdk_bit_pool_allocate(bs->used_clusters);
    if (cluster_no == SPDK_BIT_POOL_INVALID_BIT) {
        return -ENOMEM;
    }
}

Three things matter here:

is_allocated is the check on the clone's extent table. If the clone already has a private cluster for this logical index, no CoW is needed — just write to the existing cluster.
num_free_clusters is the lvstore-wide free counter. It's decremented before the cluster is used. If it's 0, the write fails with -ENOSPC. This is the "lvstore is full" error.
spdk_bit_pool_allocate returns a physical cluster index. The index is then written to the clone's extent table (or, more precisely, a metadata page that contains the extent table is queued for write).

Once the cluster is allocated, the user's data is DMA'd to it. Then the metadata update: the clone's blob md_page that contains the affected extent table entry is queued for a write. The metadata write is a single 4 KiB page write to the metadata area. After the metadata write completes, the I/O is reported as complete.

What happens on read of an unwritten cluster

A read of a cluster that the clone has never written to falls through to the parent snapshot. The blobstore handles this in spdk_blob_io_readv_ext: if the extent table entry is 0 (unallocated), the read is forwarded to the parent. The parent's data is read instead.

In other words, the clone is a sparse view of the parent. Reads from unallocated regions of the clone read from the parent. Reads from allocated regions of the clone read from the clone's own cluster. The framework doesn't have to know this is happening — it's all internal to the blobstore.

flowchart TB
subgraph Before["Before the first write to cluster 7"]
  SNAP1["snapshot 'snap1'
extents: [1..256]"]
  CL1["clone 'cl1'
extents: [empty]
all reads fall through to snap1"]
  SNAP1 -.->|cluster 7| C7A["physical cluster 7
(shared, read-only)"]
  CL1 -.->|read of cluster 7| C7A
end

subgraph After["After first write to cluster 7"]
  SNAP2["snapshot 'snap1'
extents: [1..256]
(unchanged)"]
  CL2["clone 'cl1'
extents: [..7..]
cluster 7 now own"]
  SNAP2 -.->|cluster 7| C7B["physical cluster 7
(shared, read-only)"]
  CL2 -.->|cluster 7| C7NEW["physical cluster 100
(cl1's private copy)"]
end

classDef snap fill:#d6f5d6,stroke:#2a6f2a;
classDef clone fill:#f5d6e0,stroke:#8a1c4f;
classDef shared fill:#cfe1ff,stroke:#1c4f8a;
classDef own fill:#f5e6c8,stroke:#a17f1a;
class SNAP1,SNAP2 snap
class CL1,CL2 clone
class C7A,C7B shared
class C7NEW own

fig. 1 — a clone, before and after one CoW · tap or scroll to zoom · ↗ for fullscreen

fig. 1 Before the CoW, the clone has no private copy of cluster 7 and reads from the snapshot. After the CoW, the clone owns physical cluster 100 (allocated from the free pool) and reads/writes go there. The snapshot is unchanged.

The metadata write: a small md_page

The metadata update for a CoW is tiny — typically a single 4 KiB md_page. The page contains the affected blob's metadata chain, of which the extent table is one descriptor. The exact size depends on the extent table's layout: a small lvol may have its entire extent table in one md_page; a large lvol with an extent_table has the table in one md_page and the actual extents in many extent_pages.

The metadata write is part of the bdev_io's lifetime. The blobstore doesn't return success to the bdev layer until both the data write and the metadata write have completed. The bdev layer doesn't return success to the original submitter until the blobstore's callback fires. The full chain is described in lib/blob/blobstore.c:4535 .

In diskengine's free-bytes sync, this matters for one reason: a CoW that fails partway through (data write succeeds, metadata write fails) leaves the lvstore in an inconsistent state. The cluster is allocated to the clone (so FreeClusters has been decremented) but the extent table doesn't reflect the new cluster. On the next load, the cluster will be reclaimed by the used_clusters bit pool rebuild. The clean-shutdown bit in the super block is cleared in this case, so the load is forced to validate everything.

Random-write performance: why snapshots are slower

A random write to a non-shared cluster in a regular lvol is cheap: one cluster allocation, one data write, one metadata write. A random write to a shared cluster in a clone is the same cost — one allocation, one data write, one metadata write — but the cluster being allocated is a new cluster (the clone's own), not a pre-allocated one. The data DMA goes to the new cluster. So the cost is the same.

Where CoW hurts is when most of the writes are to previously-unwritten clusters. A workload that touches every cluster of a 1 TiB clone will eventually allocate 256K clusters, but the first such write triggers the expensive "all metadata updates in one go" path. Subsequent writes are fast.

CoW and delete: refcounting clusters

A cluster is freed only when the last blob that references it is deleted. The blobstore's used_clusters bit pool is the source of truth: a cluster is in the pool iff at least one blob references it. The deletion of a blob's cluster is implemented in spdk_bs_free_cluster ( lib/blob/blobstore.c:150 ), which clears the cluster's bit in the pool.

The lvol layer's spdk_lvol_deletable ( lib/lvol/lvol.c:1870 ) checks for clones. The bdev-layer _vbdev_lvol_destroy ( module/bdev/lvol/vbdev_lvol.c:650 ) calls spdk_blob_get_clones ( module/bdev/lvol/vbdev_lvol.c:663 ) and refuses to delete if the clone count is > 1. A clone count of 1 means "this blob has no clones" — the count includes the blob itself.

Diskengine's delete path relies on this: deleting a snapshot whose children have been deleted is the same operation as deleting a regular lvol. The blobstore handles the refcounting.

Inflate, decouple, and the snapshot-merge story

Two RPCs give you a way out of the "my clone's clusters are shared with a snapshot" situation:

RPC	What it does	Cost
`bdev_lvol_inflate`	For a thin lvol, allocates a private cluster for every currently-unallocated cluster in the extent table. After this, the lvol is effectively thick (no fall-through to parent for reads).	O(size) — every unallocated cluster triggers a CoW. Can be slow for large lvols.
`bdev_lvol_decouple_parent`	Removes the parent link. Subsequent reads of unallocated clusters return zeros (or unmap'd behavior), not the parent's data.	Cheap — just clears metadata.

Inflate is the "I want a real copy, fully independent" path. After inflate, no reads fall through; the lvol owns clusters for every index. This is what you call before a destructive operation (re-encrypt, scrub, reformat) on a clone that came from a snapshot.

Decouple is the "I want to forget the parent ever existed" path. The lvol stays thin, but its parent reference is gone. Reads of unallocated clusters return zeros. This is the lighter-weight alternative to inflate when the goal is just to break the link.

There is no native "merge" or "flatten" operation that combines a snapshot's data into a single lvol. The closest equivalent is to inflate the snapshot and then delete the original — but that doubles the storage cost briefly, and the lvstore needs to have enough free clusters to hold the inflated snapshot. For an in-place merge of large snapshots, the standard pattern is: create a new empty thick lvol, copy the data over with spdk_lvol_shallow_copy or external dd over NBD, then delete the snapshot.

Edge cases & what trips people up

Concurrent writes to the same cluster in parent and child

The parent is read-only, so there are no concurrent writes to the parent. Concurrent writes to the same cluster in the child are serialized by the blobstore's per-channel lock — the lvol's spdk_io_channel is per-thread, and the blobstore ensures that at most one write to a given cluster is in flight per channel. The first write triggers CoW; subsequent writes to the same cluster hit the already-allocated cluster and don't.

CoW metadata runs out of space

The CoW needs to write the new extent table to a metadata page. If the metadata area is full, the allocation fails with -ENOSPC. The lvol's data is still consistent (the cluster was allocated, the data was written, but the metadata wasn't updated). The next load may report the lvol as inconsistent. The fix is to grow the lvstore's metadata area (spdk_lvs_grow, requires the underlying bdev to have spare space) or delete some lvols to free up metadata pages.

Read-only snapshots

A snapshot is read-only by metadata. The vbdev_lvol_io_type_supported function ( module/bdev/lvol/vbdev_lvol.c:843 ) consults spdk_blob_is_read_only to reject writes, unmaps, and write-zeroes. Reads, resets, seeks are all allowed. The framework will fail a write at submit time with SPDK_BDEV_IO_STATUS_FAILED; the request never reaches the blobstore.

Crash recovery: the clean-shutdown bit

A clean shutdown sets the clean bit in the super block to 1 ( lib/blob/blobstore.c:574 ). A crash leaves it 0. On load, the blobstore doesn't immediately fail — it just marks the lvstore as dirty and may rebuild the used_clusters bit pool from scratch. The rebuild is in bs_recover ( lib/blob/blobstore.c:9880 ) and is safe: it walks every metadata page and re-marks every cluster that's referenced by any blob.

What is not safe is a torn metadata page. If the crash happened during a CoW metadata write, the affected page may have a torn CRC, and the load will reject the page. The blobstore's response is conservative: the affected blob is marked as having an inconsistent metadata chain, and reads of clusters owned by that blob may return errors. The safe answer is to delete the affected blob and restore from a snapshot.

Inflate of a deeply-shared clone: cluster exhaustion

Inflate of a clone of size N on a lvstore with less than N free clusters will fail mid-way. The clusters allocated so far are owned by the clone (the inflate is doing the CoW as it goes), so the lvstore is now in a state where some clusters are owned by the partially-inflated clone. The inflate cannot be resumed; the only options are to delete the clone (which frees the inflate'd clusters) or to grow the lvstore. Diskengine's provisioning loop is aware of this; the resize path ( processInPlaceResizes:43 ) checks free cluster counts before issuing the resize.

What to take away

CoW is a per-cluster allocation + metadata update. The expensive part is the metadata page write, not the data write — the data DMA is the same cost as a regular write. A workload that touches every cluster of a clone pays a one-time metadata cost per cluster; subsequent writes are cheap. The blobstore refcounts clusters in its used_clusters bit pool; a cluster is freed only when the last blob that references it is deleted. Read-only snapshots are enforced at the bdev layer's io_type_supported check, not in the blobstore itself. Inflate and decouple are the two ways out of a shared state; there is no native merge. The next page, 5.4 — diskengine lvol operations traced, is the diskengine side: how the Go code drives all of this from outside.